1) The document presents a methodology for generating domain-based common word lists using word embeddings. It trains word vectors on biomedical journal abstracts and identifies words closest to the centroid as common words.
2) Applying the domain-based word list to a random forest classifier on the journal data improved accuracy from 49% to 53%.
3) Topic modeling before and after removing the domain words showed improved separation of topics in the final representation.
2. Domain-Based Common Words List Using High Dimensional
Representation of Text
Domain-Based Common Words List Using High Dimensional Representation of Text 2
3. Outline
• Introduction
• Domain-based common words
• Text vectorization
• Domain-based common words using Text Vectors Bundle
• Continuous Bag of Words (CBOW)
• Domain-based common words methodology
• Case study
Domain-Based Common Words List Using High Dimensional Representation of Text 3
4. Introduction
• Text cleaning is a crucial technique in data preprocessing.
• Stop word removal is a crucial space-saving technique in text cleaning that saves
huge amounts of space in text indexing and remove the meaningless words from
the corpus.
• Stopwords or so-called noise words are non-predictive and non-discriminating
words. They carry low information content and cause low prediction results.
• A standard stopword list is used to remove the words that carry low information
content, and does not affect the results of retrieval and categorization tasks.
• A standard stopword list is applied to any dataset regardless of the domain.
Domain-Based Common Words List Using High Dimensional Representation of Text 4
5. Domain-Based Common Words
• Domain-based common words are defined as a set of words that have no
discriminating value within a particular domain or context.
• There are many domain-based common words that differ from one domain to
another and have no value within particular domain.
• The term "protein" would be a stop word in a collection of articles addressing
biomedical data, but not in a collection describing the events of political issues.
• Eliminating these words will reduce the size of the corpus and enhance the
performance of text mining.
Domain-Based Common Words List Using High Dimensional Representation of Text 5
6. Text Vectorization
• Text Vectorization is a method of encoding contextual meaning as a vector
(ordered set) of numbers.
• The meaning of a word is closely associated with the distribution of the words that
surround it in coherent text.
• Vectors are coordinates in space (3D vector [x, y, z] can represent a coordinate in
3D space). An n-dimensional vector can represent a coordinate in n-dimensional
space.
• Text vectors are typically between 20 and 1000 dimensions.
Domain-Based Common Words List Using High Dimensional Representation of Text 6
7. Domain-Based Common Words Using Text Vectors Bundle
• HPCC Systems ML bundle
• Implement the Continuous Bag of Words
• Turns text into numerical data (words, phrases and sentences)
• Encodes the “meaning” of the text
• The Text Vectors Bundle finds the coordinates for each word so that it is
close to all words with similar meaning and distant from all words with
dissimilar meaning
• Vectors can be used as features for any ML algorithm
Domain-Based Common Words List Using High Dimensional Representation of Text 7
8. Domain-Based Common Words Using High Dimensional
Representation of Words
Analyses of both vector similarity and multidimensional scaling demonstrate that
there is significant semantic information carried in the vectors.
Project hypothesis:
Candidate domain based words are those words that have a shortest distance from
the centroid of corpus by averaging the word embedding.
Domain-Based Common Words List Using High Dimensional Representation of Text 8
9. Continuous Bag of Words (CBOW)
Domain-Based Common Words List Using High Dimensional Representation of Text 9
Continuous Bag of Words (CBOW) predicts
the current target word (the center word)
based on the context words (surrounding
words).
the quick brown fox jumps over the lazy
dog.
CBOW uses the sequence of words
“quick”, “brown”, “jumps”, “over” to predict
the central word “fox”.
10. Word Embedding Methodology
Domain-Based Common Words List Using High Dimensional Representation of Text 10
Word_2
Word_1Word_3
Word_n
𝑖=1
𝑛
Word Embedding/n
dis(Center, Word_1)
Center
11. Our Word Embedding Methodology in Detail
1. Use Text Vectors Bundle to produce
numeric vector for each unique
token in the corpus.
2. Find the centroid of the corpus by
averaging the word embedding.
3. Use distance metrics (Euclidian
distance, Cosine similarity) to find
distance from all the unique words in
the corpus and the centroid.
4. Sort the distance in ascending
orders.
5. Pick N words with the shortest
distance to be domain-based
common words.
Domain-Based Common Words List Using High Dimensional Representation of Text 11
Word_2
Word_1
Word_3
Word_n
𝑖=1
𝑛
Word Embeddings/n
dis(Center, Word_1)
Center
12. Domain-Based Common Words Steps
Domain-Based Common Words List Using High Dimensional Representation of Text 12
Corpus Tokenize
Convert to
lower case
and
remove the
special
character
Apply Text
Vectors on
corpus
Represent
each token
by n
dimension
vector
Find
centroid of
corpus
Find
distance
between
centroid
and tokens
in corpus
Sort
distance in
ascending
order
Pick N
words with
shortest
distance to
be domain
-based
words
13. Case Study
• We use the PubMed dataset because it is available publicly from National Library
of Medicine (NLM) and it is a standard research dataset.
• The PubMed dataset contains more than 26 million citations for biomedical
literature from MEDLINE, life science journals, and online books.
• We used 92,349 abstracts from 2 journals:
• Journal of Bacteriology
• Biochemical Journal
Domain-Based Common Words List Using High Dimensional Representation of Text 13
14. Apply Text Vectors Bundle on Abstracts
Domain-Based Common Words List Using High Dimensional Representation of Text 14
Id Text Vector
1 a ‘-0.0216221963’, ‘-0.1349225021’, ‘-0.1047414744’, …..
2 activity ‘0.1657295152’, ‘-0.086799950’, ‘-0.12469470462’, …..
3 and ‘0.0509911521’, ‘0.0612892603’, ‘0.09905680612’, …..
4 aromatic ‘0.093287980’, ‘0.05376208412’, ‘0.11822750605’, …..
5 binding ‘0.159114455’, ‘0.15802078223’, ‘-0.1421487348’, …..
15. Top 24 Domain-Based words after sorting
Domain-Based Common Words List Using High Dimensional Representation of Text 15
Word Distance
or 0.943816216
not 0.943816298
This 0.943816554
were 0.943817109
of 0.943817221
and 0.943817346
the 0.943817399
an 0.943817474
that 0.943817490
was 0.943817627
from 0.9438176655
patients 0.9438197143
16. Top 24 Domain-Based Words after Sorting
Domain-Based Common Words List Using High Dimensional Representation of Text 16
Word Distance
both 0.943820396
these 0.943820566
be 0.943820572
cell 0.943820632
treatment 0.943820712
are 0.943820851
which 0.943821897
other 0.943822459
found 0.943823947
when 0.943824136
level 0.943824166
protein 0.9438197143
17. Testing Domain-Based Common Words Methodology
• We used Random Forests classifier to test our methodology before and after
eliminating a set of domain based words.
• A Random Forests can handle a large numbers of records and a large numbers of
fields, and works well using the default parameters. Also, it scales well on HPCC
Systems clusters of almost any size.
• We used sentence vectors as input features to Random Forests, and about 20 %
of data reserved for testing.
• We convert our data to the form used by the ML bundles.
• We separate the Independent Variables and the Dependent Variables.
Domain-Based Common Words List Using High Dimensional Representation of Text 17
18. Random Forest Accuracy of Eliminated Various
Number of Domain-Based Common Words
Domain-Based Common Words List Using High Dimensional Representation of Text 18
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0 300 500 1000 2500 3500 5000 7000 9000 10000 11000
Accuracy
Number of Stop Words Eliminated
19. Precision and Recall Before and After Eliminating 5000
Domain- Based Common Words:
Domain-Based Common Words List Using High Dimensional Representation of Text 19
Class Precision Recall
0 0.509 0.549
1 0.486 0.435
Class Precision Recall
0 0.511 0.570
1 0.499 0.447
Before: After:
20. Testing Using Topic Modeling
Topic modeling: identify the topics in a set of documents. We use Latent Dirichlet
Allocation (LDA): a topic modeling technique to identify the distribution of topics in a
set of documents.
Domain-Based Common Words List Using High Dimensional Representation of Text 20
21. Topic Modeling Before Eliminating Domain-Based Words:
Domain-Based Common Words List Using High Dimensional Representation of Text 21
Topic
‘cells’, ‘human’, ‘alpha’, ‘protein’, ‘activity’, beta’, ‘10’, ‘expression’, ‘dna’, ‘dose’
‘pressure’, ‘blood’, ‘patients’, ‘group’, ‘flow’, ‘activity’, ‘subjects’, ‘significant’, ‘study’,
‘patient’, ‘mg’, ‘treatment’, ‘cases’, ‘disease’, ‘activity’, ‘clinical’, ‘significant’, ‘left’, ‘kg’
Topic
‘oxygen’, ‘test’, ‘surgery’, ‘weight’, ‘resistance’, ‘coronary’, ‘mice’, ‘cardiac’, ‘ventricular’
‘skin’, ‘cancer’, ‘ill’, ‘hormone’, ‘difference’, ‘positive’, ‘channel’, ‘temperature’, ‘receptor’,
‘virus’, ‘dental’, ‘sites’, ‘model’, ‘methods’, ‘infected’, ‘research’, ‘joint’, ‘children’, ‘phase’
Before:
After:
22. Conclusion
• Develop a new method to automatically generate a domain-based words using
high dimensional representation of text.
• Using Text Vector Bundle to extract the domain-based words that have a shortest
distance from the centroid of the corpus by averaging the word embedding.
• Increase the accuracy of a Random Forest classifier after eliminating a set of
domain-based words.
• Our methodology is the first that utilize high dimensional representation of words
to find the domain-based words.
Domain-Based Common Words List Using High Dimensional Representation
of Text
22
23. Future Work
• Test the domain-based methodology using other Journals such as:
• Cell Journal
• Dentist Journal
• Plant Journal
• Environmental health perspectives Journal
• The Journal of biological chemistry
• Biochemical and biophysical research communications Journal
• Brain Research Journal
Domain-Based Common Words List Using High Dimensional Representation of
Text
23
24. Future Work
• Test the domain based methodology using other classifier:
• Naïve Bayes classifier
• Support Vector Machine
• Logistic Regression
• Apply the domain based methodology on a text of different length.
• Full text of journal
• Journal abstract
Domain-Based Common Words List Using High Dimensional Representation of
Text
24
26. View this Presentation on YouTube:
https://www.youtube.com/watch?v=O-qJwxhQTzA&list=PL-8MJMUpp8IKH5-
d56az56t52YccleX5h&index=2&t=0s
(2:25:15)
Domain-Based Common Words List Using High Dimensional
Representation of Text
26
Notes de l'éditeur
We are living in the era of the big data. We have a lot of data from various domains and different sectors. For example we have medical, political, industrial and financial datasets.
In order to find the trends or to discover the hidden structure we need to preprocess the data first.
The text cleaning is an crucial technique in any data mining or NLP tasks.
One important step in the text cleaning is the stop word removal or what we called the stop word reduction which eliminate the noise words that are irrelevant to context or not predictive because they carry low info content so we don’t need these words. We need to eliminate them.
By eliminating these words we will save a huge a mount of space in text indexing.
Most of researchers use the standard stopword list is used to remove the words that carry low information content, these words are general it’s applied to any dataset regardless the domain.
The input or the context word is a one hot encoded vector of size V.
The hidden layer neurons just copy the weighted sum of inputs to the next layer.
If we have this sentence and a window 5 and we want to predict the center word fox
For this our inputs will be our context words which are passed to an embedding layer (initialized with random weights). The word embeddings are propagated to a lambda layer where we average out the word embeddings (hence called CBOW because we don’t really consider the order or sequence in the context words when averaged)and then we pass this averaged context embedding to a dense softmax layer which predicts our target word.