Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

2019 HPCC
Systems®
Community Day
Challenge Yourself –
Challenge the Status Quo
Farah Alshanik
PhD Student, CS
Clemson University

Domain-Based Common Words List Using High Dimensional
Representation of Text
Domain-Based Common Words List Using High Dimensional Representation of Text 2

Outline
• Introduction
• Domain-based common words
• Text vectorization
• Domain-based common words using Text Vectors Bundle
• Continuous Bag of Words (CBOW)
• Domain-based common words methodology
• Case study

Introduction
• Text cleaning is a crucial technique in data preprocessing.
• Stop word removal is a crucial space-saving technique in text cleaning that saves
huge amounts of space in text indexing and remove the meaningless words from
the corpus.
• Stopwords or so-called noise words are non-predictive and non-discriminating
words. They carry low information content and cause low prediction results.
• A standard stopword list is used to remove the words that carry low information
content, and does not affect the results of retrieval and categorization tasks.
• A standard stopword list is applied to any dataset regardless of the domain.

Domain-Based Common Words
• Domain-based common words are defined as a set of words that have no
discriminating value within a particular domain or context.
• There are many domain-based common words that differ from one domain to
another and have no value within particular domain.
• The term "protein" would be a stop word in a collection of articles addressing
biomedical data, but not in a collection describing the events of political issues.
• Eliminating these words will reduce the size of the corpus and enhance the
performance of text mining.

Text Vectorization
• Text Vectorization is a method of encoding contextual meaning as a vector
(ordered set) of numbers.
• The meaning of a word is closely associated with the distribution of the words that
surround it in coherent text.
• Vectors are coordinates in space (3D vector [x, y, z] can represent a coordinate in
3D space). An n-dimensional vector can represent a coordinate in n-dimensional
space.
• Text vectors are typically between 20 and 1000 dimensions.

Domain-Based Common Words Using Text Vectors Bundle
• HPCC Systems ML bundle
• Implement the Continuous Bag of Words
• Turns text into numerical data (words, phrases and sentences)
• Encodes the “meaning” of the text
• The Text Vectors Bundle finds the coordinates for each word so that it is
close to all words with similar meaning and distant from all words with
dissimilar meaning
• Vectors can be used as features for any ML algorithm

Domain-Based Common Words Using High Dimensional
Representation of Words
Analyses of both vector similarity and multidimensional scaling demonstrate that
there is significant semantic information carried in the vectors.
Project hypothesis:
Candidate domain based words are those words that have a shortest distance from
the centroid of corpus by averaging the word embedding.

Continuous Bag of Words (CBOW)
Continuous Bag of Words (CBOW) predicts
the current target word (the center word)
based on the context words (surrounding
words).
the quick brown fox jumps over the lazy
dog.
CBOW uses the sequence of words
“quick”, “brown”, “jumps”, “over” to predict
the central word “fox”.

Word Embedding Methodology
Word_2
Word_1Word_3
Word_n
𝑖=1
𝑛
Word Embedding/n
dis(Center, Word_1)
Center

Our Word Embedding Methodology in Detail
1. Use Text Vectors Bundle to produce
numeric vector for each unique
token in the corpus.
2. Find the centroid of the corpus by
averaging the word embedding.
3. Use distance metrics (Euclidian
distance, Cosine similarity) to find
distance from all the unique words in
the corpus and the centroid.
4. Sort the distance in ascending
orders.
5. Pick N words with the shortest
distance to be domain-based
common words.
Word_2
Word_1
Word_3
Word_n
𝑖=1
𝑛
Word Embeddings/n
dis(Center, Word_1)
Center

Domain-Based Common Words Steps
Corpus Tokenize
Convert to
lower case
and
remove the
special
character
Apply Text
Vectors on
corpus
Represent
each token
by n
dimension
vector
Find
centroid of
corpus
Find
distance
between
centroid
and tokens
in corpus
Sort
distance in
ascending
order
Pick N
words with
shortest
distance to
be domain
-based
words

Case Study
• We use the PubMed dataset because it is available publicly from National Library
of Medicine (NLM) and it is a standard research dataset.
• The PubMed dataset contains more than 26 million citations for biomedical
literature from MEDLINE, life science journals, and online books.
• We used 92,349 abstracts from 2 journals:
• Journal of Bacteriology
• Biochemical Journal

Apply Text Vectors Bundle on Abstracts
Id Text Vector
1 a ‘-0.0216221963’, ‘-0.1349225021’, ‘-0.1047414744’, …..
2 activity ‘0.1657295152’, ‘-0.086799950’, ‘-0.12469470462’, …..
3 and ‘0.0509911521’, ‘0.0612892603’, ‘0.09905680612’, …..
4 aromatic ‘0.093287980’, ‘0.05376208412’, ‘0.11822750605’, …..
5 binding ‘0.159114455’, ‘0.15802078223’, ‘-0.1421487348’, …..

Top 24 Domain-Based words after sorting
Word Distance
or 0.943816216
not 0.943816298
This 0.943816554
were 0.943817109
of 0.943817221
and 0.943817346
the 0.943817399
an 0.943817474
that 0.943817490
was 0.943817627
from 0.9438176655
patients 0.9438197143

Top 24 Domain-Based Words after Sorting
Word Distance
both 0.943820396
these 0.943820566
be 0.943820572
cell 0.943820632
treatment 0.943820712
are 0.943820851
which 0.943821897
other 0.943822459
found 0.943823947
when 0.943824136
level 0.943824166
protein 0.9438197143

Testing Domain-Based Common Words Methodology
• We used Random Forests classifier to test our methodology before and after
eliminating a set of domain based words.
• A Random Forests can handle a large numbers of records and a large numbers of
fields, and works well using the default parameters. Also, it scales well on HPCC
Systems clusters of almost any size.
• We used sentence vectors as input features to Random Forests, and about 20 %
of data reserved for testing.
• We convert our data to the form used by the ML bundles.
• We separate the Independent Variables and the Dependent Variables.

Random Forest Accuracy of Eliminated Various
Number of Domain-Based Common Words
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0 300 500 1000 2500 3500 5000 7000 9000 10000 11000
Accuracy
Number of Stop Words Eliminated

Precision and Recall Before and After Eliminating 5000
Domain- Based Common Words:
Class Precision Recall
0 0.509 0.549
1 0.486 0.435
Class Precision Recall
0 0.511 0.570
1 0.499 0.447
Before: After:

Testing Using Topic Modeling
Topic modeling: identify the topics in a set of documents. We use Latent Dirichlet
Allocation (LDA): a topic modeling technique to identify the distribution of topics in a
set of documents.

Topic Modeling Before Eliminating Domain-Based Words:
Topic
‘cells’, ‘human’, ‘alpha’, ‘protein’, ‘activity’, beta’, ‘10’, ‘expression’, ‘dna’, ‘dose’
‘pressure’, ‘blood’, ‘patients’, ‘group’, ‘flow’, ‘activity’, ‘subjects’, ‘significant’, ‘study’,
‘patient’, ‘mg’, ‘treatment’, ‘cases’, ‘disease’, ‘activity’, ‘clinical’, ‘significant’, ‘left’, ‘kg’
Topic
‘oxygen’, ‘test’, ‘surgery’, ‘weight’, ‘resistance’, ‘coronary’, ‘mice’, ‘cardiac’, ‘ventricular’
‘skin’, ‘cancer’, ‘ill’, ‘hormone’, ‘difference’, ‘positive’, ‘channel’, ‘temperature’, ‘receptor’,
‘virus’, ‘dental’, ‘sites’, ‘model’, ‘methods’, ‘infected’, ‘research’, ‘joint’, ‘children’, ‘phase’
Before:
After:

Conclusion
• Develop a new method to automatically generate a domain-based words using
high dimensional representation of text.
• Using Text Vector Bundle to extract the domain-based words that have a shortest
distance from the centroid of the corpus by averaging the word embedding.
• Increase the accuracy of a Random Forest classifier after eliminating a set of
domain-based words.
• Our methodology is the first that utilize high dimensional representation of words
to find the domain-based words.
Domain-Based Common Words List Using High Dimensional Representation
of Text
22

Future Work
• Test the domain-based methodology using other Journals such as:
• Cell Journal
• Dentist Journal
• Plant Journal
• Environmental health perspectives Journal
• The Journal of biological chemistry
• Biochemical and biophysical research communications Journal
• Brain Research Journal
Domain-Based Common Words List Using High Dimensional Representation of
Text
23

Future Work
• Test the domain based methodology using other classifier:
• Naïve Bayes classifier
• Support Vector Machine
• Logistic Regression
• Apply the domain based methodology on a text of different length.
• Full text of journal
• Journal abstract
Domain-Based Common Words List Using High Dimensional Representation of
Text
24

Questions?
Farah Alshanik
Clemson University
PhD Student
falshan@g.Clemson.edu

View this Presentation on YouTube:
https://www.youtube.com/watch?v=O-qJwxhQTzA&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=2&t=0s
(2:25:15)
Domain-Based Common Words List Using High Dimensional
Representation of Text
26

Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

Recommandé

Recommandé

Contenu connexe

Similaire à Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

Similaire à Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words (20)

Plus de HPCC Systems

Plus de HPCC Systems (20)

Dernier

Dernier (20)

Using High Dimensional Representation of Words (CBOW) to Find Domain Based Common Words

Notes de l'éditeur