This document summarizes a presentation on using convolutional neural networks (CNNs) for concept extraction. It discusses using CNNs to automatically extract keywords or concepts from documents, representing words as word embeddings and analyzing sequences of words as n-grams. The presentation evaluates CNN performance on concept extraction compared to part-of-speech tagging approaches, finding CNNs achieve a slightly better F1 score while increasing recall at the cost of precision for longer n-grams. It concludes by discussing possible improvements like using different word embeddings, n-gram sources, or neural architectures like RNNs.
Concept extraction with convolutional neural networks
1. Concept Extraction with
Convolutional Neural
Networks
Andreas Waldis, Luca Mazzola, and Michael Kaufmann
HSLU - Lucerne University of Applied Sciences,
School of Information Technology,
6343 - Rotkreuz,
Switzerland
7th International Conference on Data Science, Technology and Applications
DATA 2018 27/07/2018
2. Slide 2, 27-Jul-18
- XMAS: Cross-platform Mediation, Association and
Search engine
- Knowledge Management Tool
- Automatic document tagging
- Recognition of Concepts
- Represented as N-Grams (sequences of words)
- Objective: create an index based model for
Keyconcept extraction
Context
• XMAS
• Knowledge Management Tool
• Automatic Keywords extraction
DATA 2018 27/07/2018
3. Slide 3, 27-Jul-18
X-MAS example
• Concepts extracted
• Automatic summarization (KW)
DATA 2018 27/07/2018
4. Slide 4, 27-Jul-18
- Part of Speech (NLP)
- Based on syntactical characteristics of
language and frequency of typical constructs
- Requires the exhaustive creation of words n-
grams combinations (over linear) and
frequency filtering
- POS limitations
- Language dependent
- Manually laborious to design the acceptable
pattern
- Including longer n-grams reduces significantly
the precision (even if increases coverage)
POS solution
• POS limitations
DATA 2018 27/07/2018
5. Slide 5, 27-Jul-18
Examples
• POS performances
TP + FP = positive
TN + FN = negative
Pos/Neg class
True/False match
DATA 2018 27/07/2018
P
N
T
F
Concepts
Positive
True
Rutgers Preparatory School
Watts 103rd Street Rhythm Band
Twinkle Twinkle Little Star
Accademia di Belle Arti di Roma
False
Oricon Weekly Albums Chart
Grand Forks-ND-MN Metropolitan Statistical
Apple CEO Steve Jobs
Negative
True
the rims of the
in consonance with the
in which they were written
was interred in Spring Grove Cemetery
False
Los Angeles Film Critics Association Awards
United States Citizenship and Immigration Services
State of North Carolina
1917 October Revolution
7. Slide 7, 27-Jul-18
- Capability of identifying automatically:
- Regularities in the data
- Meaning of particular constructs
- Possibilities of add non-linearity by means of ReLU
activation units
- Deep model allows extremely compact network to
understand very complex problems.
- Can use any encoding of data
- We relied on the Word2Vec-plus by Google
Neural Network motivation
• Automatic knowledge extraction
• Multiple hidden layers
• Compatible with every data
encoding available
DATA 2018 27/07/2018
9. Slide 9, 27-Jul-18
- Use the Word2Vec-plus
- Holds the word vector, including also some
contextual information
- Can provide a representation for unseen words:
a) Computation based on 4 surronding words
b) Vector update
Word Embedding
• Vector representation of word
• Holds some context, also
• Can also represent unseen word
DATA 2018 27/07/2018
15. Slide 15, 27-Jul-18
- Lenght of N-Gram influences the results
- Percentage of valid concepts different per class:
Data Set distribution
• Dataset characterization
DATA 2018 27/07/2018
19. Slide 19, 27-Jul-18
Examples
True False
Positive American Educational Research
Journal
Tianjin Medical University
carry out
Bono and The Edge
Sons of the San Joaquin
Glastonbury Lake Village
Earl of Darnley
Regiment Hussars
University of Theoretical Science
Inland Aircraft Fuel Depot
NHL and
Mexican State Senate
University of
Ireland Station
In process
Negative to the start of World War II
must complete their
just a small part
a citizen of Afghanistan who
itself include
NFL and the
a Sky
therefore it is
use by
in conversation with
Council of the Isles of Scilly
Xiahou Dun
The Tenant of Wildfell Hall
DATA 2018 27/07/2018
20. Slide 20, 27-Jul-18
Cross checking POS vs. CNN
DATA 2018 27/07/2018
Concepts
True (CNN) False (CNN)
Positive
True
Rutgers Preparatory School
Watts 103rd Street Rhythm Band
Twinkle Twinkle Little Star
Accademia di Belle Arti di Roma
Capitanes de Arecibo
Fort Belknap Indian Reservation
False
Republican President Richard
Senator Ted
East Stroudsburg Senior High School North
Charles Bender High School
The New York Times Guide
Zombie Movie Encyclopedia
Negative
True
Toronto was the
in which they were written
are a family of passerine birds which
the Art Center College of Design
the NWA World Middleweight Championship
language novel
False
Legislative Council of New South Wales
1917 October Revolution
EAFF East Asian Cup
West Surrey College of Art and Design
Federal University of Rio Grande do Sul
Los Angeles Film Critics Association Awards
United States Citizenship and Immigration
Services
State of North Carolina
1917 October Revolution
23. Slide 23, 27-Jul-18
Performances w.r.t. the N-Gram length
• Dependency from lenght (n)
AUC= Area under Curve
Global comparison
metric
DATA 2018 27/07/2018
24. Slide 24, 27-Jul-18
- We presented a CNN approach for automatic
concept extraction
- We demonstrate its competitiveness w.r.t. POS,
holding a slightly better F1 measure
- Increase in recall with loss of precision, with
increasing length of N-Gram.
- Possible next steps:
- Adopt other words embedding models
- Use different n-Gram sources, extracting them
from real world documents
- Use a different architecture (RNN) to try
capturing latent and long running relationship
(LSTM)
- Train individual instances for different n and
using then the aggregated results.
Conclusions
• Results achieved
• Limits still existing
• Next research possibilities
DATA 2018 27/07/2018
25. T direct
Research
Dr. Luca Mazzola
Research Associate
+41 41 757 68 90
luca.mazzola@hslu.ch
Rotkreuz
Questions
DATA 2018 27/07/2018