Discover a new semantic tool to solve the most wicked text categorization problems.
MeaningCloud webinar, June 19, 2019.
More info and webinar contents https://www.meaningcloud.com/blog/recorded-webinar-solve-wicked-text-categorization-problems
MeaningCloud https://www.meaningcloud.com
Solve the most wicked text categorization problems - MeaningCloud webinar
1. Solve the most
wicked text
categorization
problems
June 19, 2019
MEANINGCLOUD – 2019
Webinar
2. MEANINGCLOUD - 2019
2
Presenter
How to participate
• Send questions using the chat feature, or
• Click the “Raise your hand” button to speak and we will enable your mic
• Afterwards, you’ll be able to access a recording of the webinar and its contents as
tutorials on our blog
Before we get started…
Antonio Matarranz
CMO
3. 3
MEANINGCLOUD – 2019
Why this webinar?
In the real world, there are
wicked text categorization
problems
A new approach based on
semantic analysis can solve
them
4. MEANINGCLOUD - 2019
4
Agenda
• Developing categorization models in the real world
• Categorization based on pure machine learning
• Deep Categorization API. Pre-defined models and
vertical packs
• The new Deep Categorization Customization Tool.
Semantic rule language
• Case Study: development of a categorization model
• Deep Categorization - Text Classification. When to use
one or the other
• Agile model development process. Combination with
machine learning
• Conclusions and Q&A
5. MEANINGCLOUD - 2019
5
Text categorization in a perfect world
Machine-Learning
Categorization
Model
Input text Categories
Model Training
Training texts
1) Use machine learning to train a
Model using tagged corpora
1) Collect a corpus of tagged texts
2) Represent each text by a feature
vector that models structure
and semantics
3) Train a classifier using any
suitable supervised learning
algorithm (SVM, Naïve Bayes,
kNN, Deep Learning…)
2) Categorize input text using the
Model
1. Training
2. Execution
Humans tagging
texts
6. 6
MEANINGCLOUD – 2019
Advantages (and limitations) of machine learning
• Building models is easy and fast
(provided that we have a sufficient
training set)
• Easy adaptation to new domains
• Availability of enough training data
• “Black box” model where adding new
knowledge is hard/impossible
• High “inertia”
• Does not justify categorization result
7. MEANINGCLOUD - 2019
7
Does it look familiar?
“This is our new
taxonomy, but it can
still be improved.”
“Training text? We
do not have tagged
texts.”
“It is important to
differentiate
Washington (the city)
from Washington
(the sports team),
from Washington
(the surname).”
“You have to change
the names of all our
plans and
promotions for
tomorrow.”
8. MEANINGCLOUD - 2019
8
The real world is very difficult
WICKED
PROBLEMS
Categories are not
defined or they are
evolving
We do not have
adequate training
corpus
Great precision is
required to discriminate
among categories
Context in general is
very dynamic
HUGE
DEVELOPMENT,
EXPLOITATION AND
EVOLUTION COSTS
9. MEANINGCLOUD - 2019
9
We need a different way of doing things
Agile
Text
Analytics
Rapid Model
Generation
Incorporated
Domain
Knowledge
Powerful
Configuration
and
Refinement
Quality
Assurance
An inherently iterative and
incremental process of
continuous improvement
12. MEANINGCLOUD - 2019
12
The foundation of our solution:
Deep Categorization API
Our API for wicked categorization problems
Based on the meaning of the text
➢ Leverages the deep morphosyntactic and semantic analysis that MeaningCloud performs
Deep
Categorization
Model
Input text Categories
13. MEANINGCLOUD - 2019
13
Deep Categorization predefined models
Vertical Packs
IAB 2.0
Web content
Voice of the
Customer (*)
Customer
feedback
Voice of the
Employee (*)
Employee
feedback
Intention
Analysis (*)
Stage in
customer
journey
(*) Included in MeaningCloud’s Vertical Pack
14. MEANINGCLOUD - 2019
14
Now totally customizable
Deep Categorization
Model
Input text Categories
Customization Tool
Domain
knowledge
(+ training text)
Customization Tool
15. MEANINGCLOUD - 2019
15
Categorization based on the meaning of the text
Use (generally) human-defined rules based on advanced pattern matching
1. Divide text into words
2. Normalization (stemming/lemmatization, case conversion, etc.)
3. Morphosyntactic and semantic analysis
4. Check and apply rules for detecting categories
16. MEANINGCLOUD - 2019
16
A difficult endeavour…
I'm going to buy an iPhone
I bought an iPhone
I will never buy an iPhone
Washington?, What Washington?
17. MEANINGCLOUD - 2019
17
Semantic rule language
Modularity
and Reuse
Operators
and
Expressions
Use of
Semantic
Information
Abstraction
<Rules> ->
#Category
18. MEANINGCLOUD - 2019
18
Rule language highlights (1)
• Literals, regular expressions and (multiword) phrases
• Logical (AND, OR, AND NOT) and proximity (NEAR) operators
• Lemmatization and grammatical function vs. Exact word forms
L@produce vs. produces
[new L@product|L@service@N|L@process@N|L@value@N]~4 ->
#Management>Innovation
• Macros to group words/semantic expressions and reuse them in
different rules
MACRO {pet} = dog|cat|rabbit|turtle
19. MEANINGCLOUD - 2019
19
Rule language highlights(2)
• Use of detected entities and concepts and their semantic types
S@Top>Organization>Company>FinancialCompany>BankingCompany
@instance AND NOT Bank_of_America ->
#BankAmericaCompetitors
S@Top>LivingThing>Animal::{pet}-> #NonPetAnimal
• Geographical information
{travel} AND G@America>Canada -> #Travel>Canada
• Use of categories in rules (if the text is or isn’t classified in a category it
can be used in the rules)
#SpeedAgility AND #Channel>App -> #SpeedAgilityWithApp
• Robustness to spelling mistakes (Bank of Amerca)
23. MEANINGCLOUD - 2019
23
Process
1. Write rules based on a basic
knowledge of the categories
2. Use advanced features to multiply
recall and precision
3. Apply iterative and incremental
development to refine and adapt
to dynamic scenarios
24. MEANINGCLOUD - 2019
24
A simple case
Category: Bug Report – Web
• Rule: Validation email
I didn’t receive the validation mail
I’m still waiting for the confirmation email
I’m waiting on confirmation that you have received my e-mail
receive|wait AND "validation|confirmation e-?mail|mail"
Lemma: “I didn’t receive”, “I’m waiting”…
Literal multiword expression: “validation mail”, “confirmation email”…
Regular expression: ”mail”, “email”, “e-mail”
25. 25
MEANINGCLOUD – 2019
Including semantic information (1)
Category: Bug Report – APIs
• Rule: API error
Category: Bug Report - Integrations
• Rule: Integration error
I‘m having issues with the sentiment API
I am trying to install the VoE plugin but keep receiving the error below
<MeaningCloud API mention>AND error|bug|issue|problem
<MeaningCloud Integration mention>AND error|bug|issue|problem
26. MEANINGCLOUD - 2019
26
Including semantic information (2)
Creation of a custom dictionary
• Entities and concepts, with their
semantic information
• Use them in rules
Topics Extraction
Text Classification
Sentiment Analysis
Deep Categorization
Summarization
…
API
Top
Product
Integration
Excel add-in
GATE plug-in
Google Sheets add-on
RapidMiner extension
Zapier app
…
27. MEANINGCLOUD - 2019
27
Including semantic information (3)
S@Top>Product>API AND error|bug|issue|problem
S@Top>Product>Integration AND error|bug|issue|problem
Any mention of an API product
Any mention of an Integration product
28. MEANINGCLOUD - 2019
28
Modularity and reuse applying macros
Ej.: error|bug|issue|problem appears in multiple contexts and rules
{error} = error|issue|problem|bug
{agent} = representative|agent|someone|engineer
S@Top>Product>API AND {error}
S@Top>Product>Integration AND {error}
Modular reuse
29. MEANINGCLOUD - 2019
29
Using categories within rules
• Conflicts between categories
• Rules that depend on certain categories having been triggered
Hi, I’ve received an error message when using
the sentiment analysis tool for Excel that says
“you don’t have access to this sm/model yet”
Bug Report – APIs
o
Bug Report - Integrations
#BR-INT AND #BR-API -> #BR-API
If both categories meet, exclude Bug Report – APIs
30. MEANINGCLOUD - 2019
30
E.g., releasing a new API: Insight Engine
Deep
Categorization
API
Verbatims
Deep
Categorization
Model
Dictionary
Categories
Including a new product without modifying rules
Changes are propagated to
the model without needing to
modify anything
Include “Insight Engine” in
the dictionary
31. 31
MEANINGCLOUD – 2019
Advantages (and limitations) of semantic rules
• "White box" model, where adding new
knowledge is easy
• Low "inertia"
• Errors are easy to correct
• Accuracy can be as high as desired
• Does not require tagging training corpus
• Justifies categorization results
• The development of models requires
effort (but less than manually tagging a
training set)
• Adaptation to new domains is relatively
expensive
33. 33
MEANINGCLOUD – 2019
API Comparison: Deep Categorization vs. Text Classification.
When to use one or the other?
Text Classification API
(Machine Learning + Basic Rules)
• Well defined and fixed categories
• Very big models
• Plenty of training texts are available
• Relatively static scenario
Deep Categorization API
(Semantic Rules)
• Badly defined or evolving categories
• Models that are not too extensive
• Not enough training texts are available
• High precision is required to
discriminate among categories
• Dynamic scenario
• The justification of categories is a
necessity
34. MEANINGCLOUD - 2019
34
Agile model development process. Combination with
machine learning – Option 1
Machine-Learning (ML)
Categorization
Deep Categorization
Rule ModelML Model
Input text Intermediate
categories
Categories
Model Training
Model Editor
Training texts
Rule editor
Automatic categorization engine
Classifier training engine
Classifier engine
Fast model development and high
precision from the beginning
Transparency, refinement and adaptation
36. MEANINGCLOUD - 2019
37
Customer case: contact center call categorization
in telco
• Automatic categorization of call summaries prepared by operators to extract the reason (root cause) of the call
• Goal: increase satisfaction and reduce calls to the contact center
• Challenges:
– Highly dimensional complex model
▪ 3 levels: functional area + reason + 2nd order reason /
product
▪ 56 categories in level 1; 1,615 categories in total
– High semantic overlap
– Texts with incorrect capitalization and abundant typos
– Modular categories, need to reuse definitions
– Need for evolution over time
– 10 days
• Solution:
– Abundant use of macros and "virtual" categories
– Complex rules
– Expansion of rules using Word Embeddings to discover synonyms and related terms
– Final model with 800 macros and 2,395 rules
– Recall of 80% of the texts
– Final precision: 78% in level 1, 75% exact-match
37. MEANINGCLOUD - 2019
38
Customer case: categorization of emails in banking
• Automatic categorization of email messages in the contact center
• Goal: automatic routing to the area in charge
• Challenges:
– Model with 3 orthogonal dimensions (reason + product / service + satisfaction), 39
categories in total
– 3 different languages
– High semantic overlap
– Multi-label scenario (several labels allowed)
– 4 weeks
• Solution
– One model per language
– Use of product / service dictionaries
– Abundant use of macros
– Rules with weights for relevance calculation
– Model with 590 - 733 rules, depending on language
– Final precision: 70% reason, 75% product / service, 93% satisfaction
40. MEANINGCLOUD - 2019
41
Stay tuned to our blog and emails
We’ll be posting a recording of the webinar and
its contents as tutorials soon
41. 42
MEANINGCLOUD - 2019 www.meaningcloud.com
Automating the extraction of Meaning from any information source.
+1 (646) 403-31043537 36th Street
New York, NY 11106
amatarranz@meaningcloud.com
Thank you for your attention!