SophiaConf 2018 - J. Rahajarison (My Little Adventure)
1. Smart recommendation engine
of things to do in destination
Natural Language Processing and
Machine Learning
How to automatically categorize tours
and activities ?
July 2nd 2018
3. Agenda
Introduction to machine learning
Why Natural Language Processing is so hard?
How do we process text?
Let’s try it out
Go further
3
4. What’s Machine Learning ?
Software that do something without being
explicitly programmed to, just by learning
through examples
Same software can be used for various tasks
It learns from experiences with respect to some task and
performance, and improves through experience
4
8. Obviously, you said text
Not numbers
ContextPolysemy
Synonyms
Enantiosemy
Neologisms
Sarcasm
Names
Rare words
Common sense
Dialects
Non formal / abbrev.
8
13. A bag of words
“John","likes","to","watch","movies","Mary","likes","movies","too"
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
{131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1}
[1, 2, 1, 1, 2, 1, 1]
Each unique word in our dictionary will correspond to a feature
13
14. Count of documents
TF-IDF
TF (Term Frequencies)
Occurrences of a term
IDF (Inverse Document Frequency)
log( )Count of documents where terms appear
Total words in each document
14
15. Another way: use words embeddings
Words embeddings captures relative meaning
Use vectors to get comprehensive geometry of words
15
16. Paris - France + China = Beijing
Another way: use words embeddings
16
20. Recipe
Prepare
Training / Test
data
Files, database,
cache, data flow
Selection of model,
and (hyper) parameters
Train algorithm
Use or store your
trained estimator
Make
predictions
Measure accuracy
precision
Measure
20
23. A few recommendations
Naive Bayes / Logistic Regression
Decision Trees
Random Forest
Gradient Boosting
SVM
Neural Networks
23
24. Let’s measure
Food Label Prediction
Eiffel Tower with Dinner 0.83
Gourmet tour of Paris 0.96
Dinner cruise with Champagne 1.0
Segway tour of city’s highlights 0.03
Orsay dedicated entrance 0.02
3 course meal in Eiffel Tower 0.97
Cooking class in Paris 0.89
Moulin Rouge Paris dinner show 0.91
24
Training set
Real datas
27. There is way more
Cross validation dataset
N-Grams
Wrong user content
Misspellings & typos
Hard to get training data
Harder languages or transliterations issues
Memory / computing limitations
Online learning & Stacking
27