SlideShare une entreprise Scribd logo
1  sur  26
2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Farah Alshanik
PhD Student, CS
Clemson University
Domain-Based Common Words List Using High Dimensional
Representation of Text
Domain-Based Common Words List Using High Dimensional Representation of Text 2
Outline
• Introduction
• Domain-based common words
• Text vectorization
• Domain-based common words using Text Vectors Bundle
• Continuous Bag of Words (CBOW)
• Domain-based common words methodology
• Case study
Domain-Based Common Words List Using High Dimensional Representation of Text 3
Introduction
• Text cleaning is a crucial technique in data preprocessing.
• Stop word removal is a crucial space-saving technique in text cleaning that saves
huge amounts of space in text indexing and remove the meaningless words from
the corpus.
• Stopwords or so-called noise words are non-predictive and non-discriminating
words. They carry low information content and cause low prediction results.
• A standard stopword list is used to remove the words that carry low information
content, and does not affect the results of retrieval and categorization tasks.
• A standard stopword list is applied to any dataset regardless of the domain.
Domain-Based Common Words List Using High Dimensional Representation of Text 4
Domain-Based Common Words
• Domain-based common words are defined as a set of words that have no
discriminating value within a particular domain or context.
• There are many domain-based common words that differ from one domain to
another and have no value within particular domain.
• The term "protein" would be a stop word in a collection of articles addressing
biomedical data, but not in a collection describing the events of political issues.
• Eliminating these words will reduce the size of the corpus and enhance the
performance of text mining.
Domain-Based Common Words List Using High Dimensional Representation of Text 5
Text Vectorization
• Text Vectorization is a method of encoding contextual meaning as a vector
(ordered set) of numbers.
• The meaning of a word is closely associated with the distribution of the words that
surround it in coherent text.
• Vectors are coordinates in space (3D vector [x, y, z] can represent a coordinate in
3D space). An n-dimensional vector can represent a coordinate in n-dimensional
space.
• Text vectors are typically between 20 and 1000 dimensions.
Domain-Based Common Words List Using High Dimensional Representation of Text 6
Domain-Based Common Words Using Text Vectors Bundle
• HPCC Systems ML bundle
• Implement the Continuous Bag of Words
• Turns text into numerical data (words, phrases and sentences)
• Encodes the “meaning” of the text
• The Text Vectors Bundle finds the coordinates for each word so that it is
close to all words with similar meaning and distant from all words with
dissimilar meaning
• Vectors can be used as features for any ML algorithm
Domain-Based Common Words List Using High Dimensional Representation of Text 7
Domain-Based Common Words Using High Dimensional
Representation of Words
Analyses of both vector similarity and multidimensional scaling demonstrate that
there is significant semantic information carried in the vectors.
Project hypothesis:
Candidate domain based words are those words that have a shortest distance from
the centroid of corpus by averaging the word embedding.
Domain-Based Common Words List Using High Dimensional Representation of Text 8
Continuous Bag of Words (CBOW)
Domain-Based Common Words List Using High Dimensional Representation of Text 9
Continuous Bag of Words (CBOW) predicts
the current target word (the center word)
based on the context words (surrounding
words).
the quick brown fox jumps over the lazy
dog.
CBOW uses the sequence of words
“quick”, “brown”, “jumps”, “over” to predict
the central word “fox”.
Word Embedding Methodology
Domain-Based Common Words List Using High Dimensional Representation of Text 10
Word_2
Word_1Word_3
Word_n
𝑖=1
𝑛
Word Embedding/n
dis(Center, Word_1)
Center
Our Word Embedding Methodology in Detail
1. Use Text Vectors Bundle to produce
numeric vector for each unique
token in the corpus.
2. Find the centroid of the corpus by
averaging the word embedding.
3. Use distance metrics (Euclidian
distance, Cosine similarity) to find
distance from all the unique words in
the corpus and the centroid.
4. Sort the distance in ascending
orders.
5. Pick N words with the shortest
distance to be domain-based
common words.
Domain-Based Common Words List Using High Dimensional Representation of Text 11
Word_2
Word_1
Word_3
Word_n
𝑖=1
𝑛
Word Embeddings/n
dis(Center, Word_1)
Center
Domain-Based Common Words Steps
Domain-Based Common Words List Using High Dimensional Representation of Text 12
Corpus Tokenize
Convert to
lower case
and
remove the
special
character
Apply Text
Vectors on
corpus
Represent
each token
by n
dimension
vector
Find
centroid of
corpus
Find
distance
between
centroid
and tokens
in corpus
Sort
distance in
ascending
order
Pick N
words with
shortest
distance to
be domain
-based
words
Case Study
• We use the PubMed dataset because it is available publicly from National Library
of Medicine (NLM) and it is a standard research dataset.
• The PubMed dataset contains more than 26 million citations for biomedical
literature from MEDLINE, life science journals, and online books.
• We used 92,349 abstracts from 2 journals:
• Journal of Bacteriology
• Biochemical Journal
Domain-Based Common Words List Using High Dimensional Representation of Text 13
Apply Text Vectors Bundle on Abstracts
Domain-Based Common Words List Using High Dimensional Representation of Text 14
Id Text Vector
1 a ‘-0.0216221963’, ‘-0.1349225021’, ‘-0.1047414744’, …..
2 activity ‘0.1657295152’, ‘-0.086799950’, ‘-0.12469470462’, …..
3 and ‘0.0509911521’, ‘0.0612892603’, ‘0.09905680612’, …..
4 aromatic ‘0.093287980’, ‘0.05376208412’, ‘0.11822750605’, …..
5 binding ‘0.159114455’, ‘0.15802078223’, ‘-0.1421487348’, …..
Top 24 Domain-Based words after sorting
Domain-Based Common Words List Using High Dimensional Representation of Text 15
Word Distance
or 0.943816216
not 0.943816298
This 0.943816554
were 0.943817109
of 0.943817221
and 0.943817346
the 0.943817399
an 0.943817474
that 0.943817490
was 0.943817627
from 0.9438176655
patients 0.9438197143
Top 24 Domain-Based Words after Sorting
Domain-Based Common Words List Using High Dimensional Representation of Text 16
Word Distance
both 0.943820396
these 0.943820566
be 0.943820572
cell 0.943820632
treatment 0.943820712
are 0.943820851
which 0.943821897
other 0.943822459
found 0.943823947
when 0.943824136
level 0.943824166
protein 0.9438197143
Testing Domain-Based Common Words Methodology
• We used Random Forests classifier to test our methodology before and after
eliminating a set of domain based words.
• A Random Forests can handle a large numbers of records and a large numbers of
fields, and works well using the default parameters. Also, it scales well on HPCC
Systems clusters of almost any size.
• We used sentence vectors as input features to Random Forests, and about 20 %
of data reserved for testing.
• We convert our data to the form used by the ML bundles.
• We separate the Independent Variables and the Dependent Variables.
Domain-Based Common Words List Using High Dimensional Representation of Text 17
Random Forest Accuracy of Eliminated Various
Number of Domain-Based Common Words
Domain-Based Common Words List Using High Dimensional Representation of Text 18
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0 300 500 1000 2500 3500 5000 7000 9000 10000 11000
Accuracy
Number of Stop Words Eliminated
Precision and Recall Before and After Eliminating 5000
Domain- Based Common Words:
Domain-Based Common Words List Using High Dimensional Representation of Text 19
Class Precision Recall
0 0.509 0.549
1 0.486 0.435
Class Precision Recall
0 0.511 0.570
1 0.499 0.447
Before: After:
Testing Using Topic Modeling
Topic modeling: identify the topics in a set of documents. We use Latent Dirichlet
Allocation (LDA): a topic modeling technique to identify the distribution of topics in a
set of documents.
Domain-Based Common Words List Using High Dimensional Representation of Text 20
Topic Modeling Before Eliminating Domain-Based Words:
Domain-Based Common Words List Using High Dimensional Representation of Text 21
Topic
‘cells’, ‘human’, ‘alpha’, ‘protein’, ‘activity’, beta’, ‘10’, ‘expression’, ‘dna’, ‘dose’
‘pressure’, ‘blood’, ‘patients’, ‘group’, ‘flow’, ‘activity’, ‘subjects’, ‘significant’, ‘study’,
‘patient’, ‘mg’, ‘treatment’, ‘cases’, ‘disease’, ‘activity’, ‘clinical’, ‘significant’, ‘left’, ‘kg’
Topic
‘oxygen’, ‘test’, ‘surgery’, ‘weight’, ‘resistance’, ‘coronary’, ‘mice’, ‘cardiac’, ‘ventricular’
‘skin’, ‘cancer’, ‘ill’, ‘hormone’, ‘difference’, ‘positive’, ‘channel’, ‘temperature’, ‘receptor’,
‘virus’, ‘dental’, ‘sites’, ‘model’, ‘methods’, ‘infected’, ‘research’, ‘joint’, ‘children’, ‘phase’
Before:
After:
Conclusion
• Develop a new method to automatically generate a domain-based words using
high dimensional representation of text.
• Using Text Vector Bundle to extract the domain-based words that have a shortest
distance from the centroid of the corpus by averaging the word embedding.
• Increase the accuracy of a Random Forest classifier after eliminating a set of
domain-based words.
• Our methodology is the first that utilize high dimensional representation of words
to find the domain-based words.
Domain-Based Common Words List Using High Dimensional Representation
of Text
22
Future Work
• Test the domain-based methodology using other Journals such as:
• Cell Journal
• Dentist Journal
• Plant Journal
• Environmental health perspectives Journal
• The Journal of biological chemistry
• Biochemical and biophysical research communications Journal
• Brain Research Journal
Domain-Based Common Words List Using High Dimensional Representation of
Text
23
Future Work
• Test the domain based methodology using other classifier:
• Naïve Bayes classifier
• Support Vector Machine
• Logistic Regression
• Apply the domain based methodology on a text of different length.
• Full text of journal
• Journal abstract
Domain-Based Common Words List Using High Dimensional Representation of
Text
24
Questions?
Farah Alshanik
Clemson University
PhD Student
falshan@g.Clemson.edu
Domain-Based Common Words List Using High Dimensional Representation of Text 25
View this Presentation on YouTube:
https://www.youtube.com/watch?v=O-qJwxhQTzA&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=2&t=0s
(2:25:15)
Domain-Based Common Words List Using High Dimensional
Representation of Text
26

Contenu connexe

Similaire à Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Effective Classification of Clinical Reports: Natural Language Processing-Bas...Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Effective Classification of Clinical Reports: Natural Language Processing-Bas...Efsun Kayi
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Ana Marasović
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networksconnectbeubax
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
The application of artificial intelligence
The application of artificial intelligenceThe application of artificial intelligence
The application of artificial intelligencePallavi Vashistha
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
Dynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modellingDynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modellingCSCJournals
 

Similaire à Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words (20)

svm_AD
svm_ADsvm_AD
svm_AD
 
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Effective Classification of Clinical Reports: Natural Language Processing-Bas...Effective Classification of Clinical Reports: Natural Language Processing-Bas...
Effective Classification of Clinical Reports: Natural Language Processing-Bas...
 
Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...Colloquium talk on modal sense classification using a convolutional neural ne...
Colloquium talk on modal sense classification using a convolutional neural ne...
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
R Basics
R BasicsR Basics
R Basics
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Word 2 vector
Word 2 vectorWord 2 vector
Word 2 vector
 
The application of artificial intelligence
The application of artificial intelligenceThe application of artificial intelligence
The application of artificial intelligence
 
Word2vector
Word2vectorWord2vector
Word2vector
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
Text summarization
Text summarization Text summarization
Text summarization
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Dynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modellingDynamic Audio-Visual Client Recognition modelling
Dynamic Audio-Visual Client Recognition modelling
 

Plus de HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsHPCC Systems
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn HPCC Systems
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingHPCC Systems
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle ChangesHPCC Systems
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index HPCC Systems
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningHPCC Systems
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch HPCC Systems
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem HPCC Systems
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis ToolHPCC Systems
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony HPCC Systems
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems
 

Plus de HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Towards Trustable AI for Complex Systems
Towards Trustable AI for Complex SystemsTowards Trustable AI for Complex Systems
Towards Trustable AI for Complex Systems
 
Welcome
WelcomeWelcome
Welcome
 
Closing / Adjourn
Closing / Adjourn Closing / Adjourn
Closing / Adjourn
 
Community Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon CuttingCommunity Website: Virtual Ribbon Cutting
Community Website: Virtual Ribbon Cutting
 
Path to 8.0
Path to 8.0 Path to 8.0
Path to 8.0
 
Release Cycle Changes
Release Cycle ChangesRelease Cycle Changes
Release Cycle Changes
 
Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index Geohashing with Uber’s H3 Geospatial Index
Geohashing with Uber’s H3 Geospatial Index
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 
Docker Support
Docker Support Docker Support
Docker Support
 
Expanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network CapabilitiesExpanding HPCC Systems Deep Neural Network Capabilities
Expanding HPCC Systems Deep Neural Network Capabilities
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch DataPatterns - Profiling in ECL Watch
DataPatterns - Profiling in ECL Watch
 
Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem Leveraging the Spark-HPCC Ecosystem
Leveraging the Spark-HPCC Ecosystem
 
Work Unit Analysis Tool
Work Unit Analysis ToolWork Unit Analysis Tool
Work Unit Analysis Tool
 
Community Award Ceremony
Community Award Ceremony Community Award Ceremony
Community Award Ceremony
 
Dapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL NeaterDapper Tool - A Bundle to Make your ECL Neater
Dapper Tool - A Bundle to Make your ECL Neater
 
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
 
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
 

Dernier

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Dernier (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

  • 1. 2019 HPCC Systems® Community Day Challenge Yourself – Challenge the Status Quo Farah Alshanik PhD Student, CS Clemson University
  • 2. Domain-Based Common Words List Using High Dimensional Representation of Text Domain-Based Common Words List Using High Dimensional Representation of Text 2
  • 3. Outline • Introduction • Domain-based common words • Text vectorization • Domain-based common words using Text Vectors Bundle • Continuous Bag of Words (CBOW) • Domain-based common words methodology • Case study Domain-Based Common Words List Using High Dimensional Representation of Text 3
  • 4. Introduction • Text cleaning is a crucial technique in data preprocessing. • Stop word removal is a crucial space-saving technique in text cleaning that saves huge amounts of space in text indexing and remove the meaningless words from the corpus. • Stopwords or so-called noise words are non-predictive and non-discriminating words. They carry low information content and cause low prediction results. • A standard stopword list is used to remove the words that carry low information content, and does not affect the results of retrieval and categorization tasks. • A standard stopword list is applied to any dataset regardless of the domain. Domain-Based Common Words List Using High Dimensional Representation of Text 4
  • 5. Domain-Based Common Words • Domain-based common words are defined as a set of words that have no discriminating value within a particular domain or context. • There are many domain-based common words that differ from one domain to another and have no value within particular domain. • The term "protein" would be a stop word in a collection of articles addressing biomedical data, but not in a collection describing the events of political issues. • Eliminating these words will reduce the size of the corpus and enhance the performance of text mining. Domain-Based Common Words List Using High Dimensional Representation of Text 5
  • 6. Text Vectorization • Text Vectorization is a method of encoding contextual meaning as a vector (ordered set) of numbers. • The meaning of a word is closely associated with the distribution of the words that surround it in coherent text. • Vectors are coordinates in space (3D vector [x, y, z] can represent a coordinate in 3D space). An n-dimensional vector can represent a coordinate in n-dimensional space. • Text vectors are typically between 20 and 1000 dimensions. Domain-Based Common Words List Using High Dimensional Representation of Text 6
  • 7. Domain-Based Common Words Using Text Vectors Bundle • HPCC Systems ML bundle • Implement the Continuous Bag of Words • Turns text into numerical data (words, phrases and sentences) • Encodes the “meaning” of the text • The Text Vectors Bundle finds the coordinates for each word so that it is close to all words with similar meaning and distant from all words with dissimilar meaning • Vectors can be used as features for any ML algorithm Domain-Based Common Words List Using High Dimensional Representation of Text 7
  • 8. Domain-Based Common Words Using High Dimensional Representation of Words Analyses of both vector similarity and multidimensional scaling demonstrate that there is significant semantic information carried in the vectors. Project hypothesis: Candidate domain based words are those words that have a shortest distance from the centroid of corpus by averaging the word embedding. Domain-Based Common Words List Using High Dimensional Representation of Text 8
  • 9. Continuous Bag of Words (CBOW) Domain-Based Common Words List Using High Dimensional Representation of Text 9 Continuous Bag of Words (CBOW) predicts the current target word (the center word) based on the context words (surrounding words). the quick brown fox jumps over the lazy dog. CBOW uses the sequence of words “quick”, “brown”, “jumps”, “over” to predict the central word “fox”.
  • 10. Word Embedding Methodology Domain-Based Common Words List Using High Dimensional Representation of Text 10 Word_2 Word_1Word_3 Word_n 𝑖=1 𝑛 Word Embedding/n dis(Center, Word_1) Center
  • 11. Our Word Embedding Methodology in Detail 1. Use Text Vectors Bundle to produce numeric vector for each unique token in the corpus. 2. Find the centroid of the corpus by averaging the word embedding. 3. Use distance metrics (Euclidian distance, Cosine similarity) to find distance from all the unique words in the corpus and the centroid. 4. Sort the distance in ascending orders. 5. Pick N words with the shortest distance to be domain-based common words. Domain-Based Common Words List Using High Dimensional Representation of Text 11 Word_2 Word_1 Word_3 Word_n 𝑖=1 𝑛 Word Embeddings/n dis(Center, Word_1) Center
  • 12. Domain-Based Common Words Steps Domain-Based Common Words List Using High Dimensional Representation of Text 12 Corpus Tokenize Convert to lower case and remove the special character Apply Text Vectors on corpus Represent each token by n dimension vector Find centroid of corpus Find distance between centroid and tokens in corpus Sort distance in ascending order Pick N words with shortest distance to be domain -based words
  • 13. Case Study • We use the PubMed dataset because it is available publicly from National Library of Medicine (NLM) and it is a standard research dataset. • The PubMed dataset contains more than 26 million citations for biomedical literature from MEDLINE, life science journals, and online books. • We used 92,349 abstracts from 2 journals: • Journal of Bacteriology • Biochemical Journal Domain-Based Common Words List Using High Dimensional Representation of Text 13
  • 14. Apply Text Vectors Bundle on Abstracts Domain-Based Common Words List Using High Dimensional Representation of Text 14 Id Text Vector 1 a ‘-0.0216221963’, ‘-0.1349225021’, ‘-0.1047414744’, ….. 2 activity ‘0.1657295152’, ‘-0.086799950’, ‘-0.12469470462’, ….. 3 and ‘0.0509911521’, ‘0.0612892603’, ‘0.09905680612’, ….. 4 aromatic ‘0.093287980’, ‘0.05376208412’, ‘0.11822750605’, ….. 5 binding ‘0.159114455’, ‘0.15802078223’, ‘-0.1421487348’, …..
  • 15. Top 24 Domain-Based words after sorting Domain-Based Common Words List Using High Dimensional Representation of Text 15 Word Distance or 0.943816216 not 0.943816298 This 0.943816554 were 0.943817109 of 0.943817221 and 0.943817346 the 0.943817399 an 0.943817474 that 0.943817490 was 0.943817627 from 0.9438176655 patients 0.9438197143
  • 16. Top 24 Domain-Based Words after Sorting Domain-Based Common Words List Using High Dimensional Representation of Text 16 Word Distance both 0.943820396 these 0.943820566 be 0.943820572 cell 0.943820632 treatment 0.943820712 are 0.943820851 which 0.943821897 other 0.943822459 found 0.943823947 when 0.943824136 level 0.943824166 protein 0.9438197143
  • 17. Testing Domain-Based Common Words Methodology • We used Random Forests classifier to test our methodology before and after eliminating a set of domain based words. • A Random Forests can handle a large numbers of records and a large numbers of fields, and works well using the default parameters. Also, it scales well on HPCC Systems clusters of almost any size. • We used sentence vectors as input features to Random Forests, and about 20 % of data reserved for testing. • We convert our data to the form used by the ML bundles. • We separate the Independent Variables and the Dependent Variables. Domain-Based Common Words List Using High Dimensional Representation of Text 17
  • 18. Random Forest Accuracy of Eliminated Various Number of Domain-Based Common Words Domain-Based Common Words List Using High Dimensional Representation of Text 18 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0 300 500 1000 2500 3500 5000 7000 9000 10000 11000 Accuracy Number of Stop Words Eliminated
  • 19. Precision and Recall Before and After Eliminating 5000 Domain- Based Common Words: Domain-Based Common Words List Using High Dimensional Representation of Text 19 Class Precision Recall 0 0.509 0.549 1 0.486 0.435 Class Precision Recall 0 0.511 0.570 1 0.499 0.447 Before: After:
  • 20. Testing Using Topic Modeling Topic modeling: identify the topics in a set of documents. We use Latent Dirichlet Allocation (LDA): a topic modeling technique to identify the distribution of topics in a set of documents. Domain-Based Common Words List Using High Dimensional Representation of Text 20
  • 21. Topic Modeling Before Eliminating Domain-Based Words: Domain-Based Common Words List Using High Dimensional Representation of Text 21 Topic ‘cells’, ‘human’, ‘alpha’, ‘protein’, ‘activity’, beta’, ‘10’, ‘expression’, ‘dna’, ‘dose’ ‘pressure’, ‘blood’, ‘patients’, ‘group’, ‘flow’, ‘activity’, ‘subjects’, ‘significant’, ‘study’, ‘patient’, ‘mg’, ‘treatment’, ‘cases’, ‘disease’, ‘activity’, ‘clinical’, ‘significant’, ‘left’, ‘kg’ Topic ‘oxygen’, ‘test’, ‘surgery’, ‘weight’, ‘resistance’, ‘coronary’, ‘mice’, ‘cardiac’, ‘ventricular’ ‘skin’, ‘cancer’, ‘ill’, ‘hormone’, ‘difference’, ‘positive’, ‘channel’, ‘temperature’, ‘receptor’, ‘virus’, ‘dental’, ‘sites’, ‘model’, ‘methods’, ‘infected’, ‘research’, ‘joint’, ‘children’, ‘phase’ Before: After:
  • 22. Conclusion • Develop a new method to automatically generate a domain-based words using high dimensional representation of text. • Using Text Vector Bundle to extract the domain-based words that have a shortest distance from the centroid of the corpus by averaging the word embedding. • Increase the accuracy of a Random Forest classifier after eliminating a set of domain-based words. • Our methodology is the first that utilize high dimensional representation of words to find the domain-based words. Domain-Based Common Words List Using High Dimensional Representation of Text 22
  • 23. Future Work • Test the domain-based methodology using other Journals such as: • Cell Journal • Dentist Journal • Plant Journal • Environmental health perspectives Journal • The Journal of biological chemistry • Biochemical and biophysical research communications Journal • Brain Research Journal Domain-Based Common Words List Using High Dimensional Representation of Text 23
  • 24. Future Work • Test the domain based methodology using other classifier: • Naïve Bayes classifier • Support Vector Machine • Logistic Regression • Apply the domain based methodology on a text of different length. • Full text of journal • Journal abstract Domain-Based Common Words List Using High Dimensional Representation of Text 24
  • 25. Questions? Farah Alshanik Clemson University PhD Student falshan@g.Clemson.edu Domain-Based Common Words List Using High Dimensional Representation of Text 25
  • 26. View this Presentation on YouTube: https://www.youtube.com/watch?v=O-qJwxhQTzA&list=PL-8MJMUpp8IKH5- d56az56t52YccleX5h&index=2&t=0s (2:25:15) Domain-Based Common Words List Using High Dimensional Representation of Text 26

Notes de l'éditeur

  1. We are living in the era of the big data. We have a lot of data from various domains and different sectors. For example we have medical, political, industrial and financial datasets. In order to find the trends or to discover the hidden structure we need to preprocess the data first. The text cleaning is an crucial technique in any data mining or NLP tasks. One important step in the text cleaning is the stop word removal or what we called the stop word reduction which eliminate the noise words that are irrelevant to context or not predictive because they carry low info content so we don’t need these words. We need to eliminate them. By eliminating these words we will save a huge a mount of space in text indexing. Most of researchers use the standard stopword list is used to remove the words that carry low information content, these words are general it’s applied to any dataset regardless the domain.
  2. The input or the context word is a one hot encoded vector of size V. The hidden layer neurons just copy the weighted sum of inputs to the next layer. If we have this sentence and a window 5 and we want to predict the center word fox For this our inputs will be our context words which are passed to an embedding layer (initialized with random weights). The word embeddings are propagated to a lambda layer where we average out the word embeddings (hence called CBOW because we don’t really consider the order or sequence in the context words when averaged)and then we pass this averaged context embedding to a dense softmax layer which predicts our target word.