SophiaConf 2018 - J. Rahajarison (My Little Adventure)

TelecomValley
TelecomValleyTelecomValley
Smart recommendation engine
of things to do in destination
Natural Language Processing and
Machine Learning
How to automatically categorize tours
and activities ?
July 2nd 2018
Introduction
MyLittleAdventure
@mylitadventure
Johnny RAHAJARISON
@brainstorm_me
johnny.rahajarison@mylittleadventure.com
2
Agenda
Introduction to machine learning
Why Natural Language Processing is so hard?
How do we process text?
Let’s try it out
Go further
3
What’s Machine Learning ?
Software that do something without being
explicitly programmed to, just by learning
through examples
Same software can be used for various tasks
It learns from experiences with respect to some task and
performance, and improves through experience
4
Unsupervised algorithms
Unsupervised algorithms
ClusteringAnomaly detection
5
Supervised algorithms
Supervised algorithms
ClassificationRegression
6
You said text, right?
7
Obviously, you said text
Not numbers
ContextPolysemy
Synonyms
Enantiosemy
Neologisms
Sarcasm
Names
Rare words
Common sense
Dialects
Non formal / abbrev.
8
Ambiguity?
9
I saw a man on a hill with a telescope.
Ambiguity?
10
I saw a man on a hill with a telescope.
Text should be prepared
11
Let’s clean our text first
['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found',
'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', 'He', 'lay', 'on',
'hi', 'armour-lik', 'back', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could',
'see', 'hi', 'brown', 'belli', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff',
'section', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to',
'slide', 'off', 'ani', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'with', 'the',
'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessli', 'as', 'he', 'look',
'what', "'s", 'happen', ‘to']
✓ Tokenize sentences
✓ Tokenize words
✓ Transliterate
✓ Normalize
✓ Filter out 

(punctuation, special characters, stop words)
✓ Use a stemmer and / or a lemmatizer

("be" = am, are, is; “vari" = variation, vary, varies, variables)
12
A bag of words
“John","likes","to","watch","movies","Mary","likes","movies","too"
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
{131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1}
[1, 2, 1, 1, 2, 1, 1]
Each unique word in our dictionary will correspond to a feature
13
Count of documents
TF-IDF
TF (Term Frequencies)
Occurrences of a term
IDF (Inverse Document Frequency)
log( )Count of documents where terms appear
Total words in each document
14
Another way: use words embeddings
Words embeddings captures relative meaning
Use vectors to get comprehensive geometry of words
15
Paris - France + China = Beijing
Another way: use words embeddings
16
Example of “movies" vector
movies -0.34582 0.057328 0.1328 0.22376 0.10161 0.52948 -0.30199 0.45676 -0.37643 -0.51857 0.67325 -0.012444 -0.099021 0.43823
-0.28905 -1.0183 -0.0062387 -0.32893 0.55547 0.44181 0.31524 0.29909 0.51605 0.32109 0.021471 0.67909 0.037333 -0.42321
0.56517 0.47979 -0.63307 0.1126 0.0050579 -0.18879 -0.87478 -0.29481 -0.70824 -0.072256 0.1614 0.34523 0.61872 -0.036932
-0.43343 0.29604 0.18671 -0.33384 0.50628 -0.013876 0.46303 0.19298 0.16783 -0.55786 -0.16947 -0.27382 0.31027 0.10974 0.12819
0.23538 0.038003 -0.077524 -0.23291 0.044094 0.36325 0.20611 0.55571 -0.022715 -0.04996 0.32312 0.44176 0.25272 0.15159
0.22682 -0.10425 0.73375 0.66572 -0.55885 0.082242 -0.13387 0.31042 -0.38443 -0.38631 -0.7518 0.6706 -0.17495 0.056298 0.82038
0.41573 -0.12316 0.28437 -0.19324 -0.13485 0.28862 -0.37817 0.37268 0.01515 0.39123 0.059544 -0.074006 -0.17152 -1.1523
0.26541 0.082314 0.17914 -0.089861 -0.20884 0.29248 -0.60263 -0.0024285 0.24521 -0.5427 -0.074404 0.14034 0.0085891 -0.37351
0.23573 0.1493 -0.14038 0.11725 -0.51013 -0.64531 0.1329 0.075911 -0.10827 0.22077 -0.086253 0.4096 0.052314 0.40964 -0.030506
0.30572 -0.40694 -0.11773 0.21586 0.14448 0.23419 -0.23401 0.06811 0.29447 -0.4086 0.88777 -0.19477 -0.18847 0.10324 -0.24593
-0.10173 -0.43226 -0.091173 -0.092602 -0.23385 -0.16498 0.22057 0.11014 -0.25018 -0.43089 0.19759 0.11762 -0.045432 0.13331
0.032684 -0.21702 0.35082 -0.40466 -0.02425 -0.22637 0.0094442 0.72848 0.10286 0.27199 -0.40396 0.22366 -0.039481 -0.17164
-1.7307 0.3706 -0.13711 0.2295 -0.34432 -0.024381 -0.093941 -0.29861 -0.33164 -0.12931 -0.11218 0.047052 0.40442 0.0043382
0.22364 -0.31537 0.1987 -0.46108 -0.35126 -0.14584 0.17765 0.10869 -0.14434 -0.6152 -0.5874 0.014977 -0.1691 -0.46926 1.3959
-0.15449 -0.24167 -0.002575 0.4758 -0.044786 -0.21345 0.22983 -0.34356 -0.43402 -0.45719 -0.29775 -0.053295 0.50132 -0.24066
0.45762 0.095118 0.21008 0.71912 0.028577 -0.64176 0.1314 0.21556 -0.12536 -0.3298 -0.07123 0.35428 -0.3787 0.12348 -0.060439
0.19217 -0.29951 -0.73189 -0.33589 0.449 0.22654 1.0404 0.019947 -0.74711 0.071042 0.067809 0.36341 -0.32579 -0.11085 -0.24507
-0.13518 -0.44326 0.022784 -0.57252 0.33756 -0.23411 -0.062955 -0.35353 1.0497 -0.14938 -0.57772 0.27652 -0.28787 -0.0040621
0.25113 0.40818 -0.13227 0.016032 -0.55465 0.0021098 -0.27755 0.16082 -0.055202 0.21104 0.58412 0.42842 -0.047253 0.10542
0.027478 0.30911 0.31792 -1.8564 0.014412 -0.29748 -0.70103 -0.068219 -0.53071 -0.10661 0.028596 0.081479 0.34323 -0.047833
0.023129 0.028697 0.33859 -0.20706 -0.0025571 -0.18267 -0.26946 -1.1064 -0.31228 -0.13101 0.1161 -0.068647 -0.09988
Another way: use words embeddings
17
[[], 2*[], [], [], 2 *[-0.34582, 0.057328, … 0.22376, 0.10161], [], []]
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}
{131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1}
[1, 2, 1, 1, 2, 1, 1]
Another way: use words embeddings
Embeddings vector for “movies"
18
Let’s predict
19
Recipe
Prepare
Training / Test
data
Files, database,
cache, data flow
Selection of model,
and (hyper) parameters
Train algorithm
Use or store your
trained estimator
Make
predictions
Measure accuracy
precision
Measure
20
Collect our training & test dataset
Food Label Vectorized
Eiffel Tower with Dinner
[ 0., 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0.5, 0., 0.5],
Skip the line Eiffel Tower
[ 0., 0., 0., 0., 0., 0.3967171 , 0., 0., 0., 0.47792296, 0., 0.,
0., 0., 0., 0.47792296, 0.47792296, 0., 0., 0.3967171 , 0., 0.],
Louvre Museum fast track
[ 0., 0., 0., 0., 0., 0., 0.5, 0., 0., 0., 0.5, 0.5, 0., 0., 0.,
0., 0., 0., 0., 0., 0.5, 0.],
Gourmet tour of Paris
[ 0., 0., 0., 0., 0., 0., 0., 0.58910044, 0., 0., 0., 0.,
0.41798437, 0.48900396, 0., 0., 0., 0., 0.48900396, 0., 0., 0.],
Segway tour of city’s highlights
[ 0., 0., 0.48838773, 0., 0., 0., 0., 0., 0.48838773, 0., 0., 0.,
0.3465257 , 0., 0.48838773, 0., 0., 0., 0.40540376, 0., 0., 0.],
Dinner cruise with Champagne
[ 0., 0.54408243, 0., 0.54408243, 0.45163515, 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.45163515],
Aquarium of Paris ticket
[ 0.55967542, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.39710644, 0.46457866, 0., 0., 0., 0.55967542, 0., 0., 0., 0.]
… …
21
Choose a classifier algorithm
22
A few recommendations
Naive Bayes / Logistic Regression
Decision Trees
Random Forest
Gradient Boosting
SVM
Neural Networks
23
Let’s measure
Food Label Prediction
Eiffel Tower with Dinner 0.83
Gourmet tour of Paris 0.96
Dinner cruise with Champagne 1.0
Segway tour of city’s highlights 0.03
Orsay dedicated entrance 0.02
3 course meal in Eiffel Tower 0.97
Cooking class in Paris 0.89
Moulin Rouge Paris dinner show 0.91
24
Training set
Real datas
25
Go further
26
There is way more
Cross validation dataset
N-Grams
Wrong user content
Misspellings & typos
Hard to get training data
Harder languages or transliterations issues
Memory / computing limitations
Online learning & Stacking
27
Some resources
https://www.slideshare.net/mylittleadventure/introduction-machine-learning-by-mylittleadventure
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://bit.ly/2uL954v
NLTK
Book
Stanford’s GloVe
DatasetCourse
Andrew Ng (coursera)
Platform
28
Libraries
Thank you
July 2nd 2018
Questions ?
@mylitadventure
@brainstorm_me
johnny.rahajarison@mylittleadventure.com
1 sur 29

Recommandé

Lazard network correlation_architecture par
Lazard network correlation_architectureLazard network correlation_architecture
Lazard network correlation_architectureJean Meilhoc Ricaume
73 vues58 diapositives
The Fifth Dialog State Tracking Challenge (DSTC5) par
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)Seokhwan Kim
645 vues1 diapositive
Partial least square path modeling with adanco par
Partial least square path modeling with adancoPartial least square path modeling with adanco
Partial least square path modeling with adancoTeetut Tresirichod
1.7K vues93 diapositives
How well are you delivering your experience? par
How well are you delivering your experience?How well are you delivering your experience?
How well are you delivering your experience?Andrew Fisher
968 vues37 diapositives
Is observability good for your brain? par
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?Sematext Group, Inc.
751 vues28 diapositives
Future Designers Workshop par
Future Designers WorkshopFuture Designers Workshop
Future Designers WorkshopAndrzej Szymczak
289 vues37 diapositives

Contenu connexe

Similaire à SophiaConf 2018 - J. Rahajarison (My Little Adventure)

Harkable Day of Innovation Oct 2013 - Hark in the Park par
Harkable Day of Innovation Oct 2013 - Hark in the ParkHarkable Day of Innovation Oct 2013 - Hark in the Park
Harkable Day of Innovation Oct 2013 - Hark in the ParkHarkable
285 vues37 diapositives
Fighting Digital Dizzyness par
Fighting Digital DizzynessFighting Digital Dizzyness
Fighting Digital DizzynessDominique Sciamma
315 vues72 diapositives
What's in your workflow? Bringing data science workflows to business analysis... par
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab
2.9K vues27 diapositives
What's in your Workflow? par
What's in your Workflow?What's in your Workflow?
What's in your Workflow?Emily Riederer
266 vues27 diapositives
Wearables that rocks my world and some that don't par
Wearables that rocks my world and some that don'tWearables that rocks my world and some that don't
Wearables that rocks my world and some that don'tLBi
630 vues58 diapositives
Business statistics -_assignment_dec_2019_zf_sgc5ylme par
Business statistics -_assignment_dec_2019_zf_sgc5ylmeBusiness statistics -_assignment_dec_2019_zf_sgc5ylme
Business statistics -_assignment_dec_2019_zf_sgc5ylmeAssignmentchimp
60 vues3 diapositives

Similaire à SophiaConf 2018 - J. Rahajarison (My Little Adventure)(20)

Harkable Day of Innovation Oct 2013 - Hark in the Park par Harkable
Harkable Day of Innovation Oct 2013 - Hark in the ParkHarkable Day of Innovation Oct 2013 - Hark in the Park
Harkable Day of Innovation Oct 2013 - Hark in the Park
Harkable285 vues
What's in your workflow? Bringing data science workflows to business analysis... par Domino Data Lab
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab 2.9K vues
Wearables that rocks my world and some that don't par LBi
Wearables that rocks my world and some that don'tWearables that rocks my world and some that don't
Wearables that rocks my world and some that don't
LBi630 vues
Business statistics -_assignment_dec_2019_zf_sgc5ylme par Assignmentchimp
Business statistics -_assignment_dec_2019_zf_sgc5ylmeBusiness statistics -_assignment_dec_2019_zf_sgc5ylme
Business statistics -_assignment_dec_2019_zf_sgc5ylme
Assignmentchimp60 vues
Webconf 2013 - Media Query 123 par Hina Chen
Webconf 2013 - Media Query 123Webconf 2013 - Media Query 123
Webconf 2013 - Media Query 123
Hina Chen571 vues
Performics at CES: Day 2 par Performics
Performics at CES: Day 2 Performics at CES: Day 2
Performics at CES: Day 2
Performics996 vues
"The Cutting Edge" - Palletways Business Club Presentation par george_edwards
"The Cutting Edge" - Palletways Business Club Presentation"The Cutting Edge" - Palletways Business Club Presentation
"The Cutting Edge" - Palletways Business Club Presentation
george_edwards257 vues
Data science in action par Longhow Lam
Data science in actionData science in action
Data science in action
Longhow Lam535 vues
And then there were ... Large Language Models par Leon Dohmen
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen2.4K vues
Patient Zero, One, One, Zero, One par Chris Dancy
Patient Zero, One, One, Zero, OnePatient Zero, One, One, Zero, One
Patient Zero, One, One, Zero, One
Chris Dancy566 vues
Aplikasi Media Pemasaran Properti dengan menggunakan Teknologi Augmented Real... par Ahmad Arif Faizin
Aplikasi Media Pemasaran Properti dengan menggunakan Teknologi Augmented Real...Aplikasi Media Pemasaran Properti dengan menggunakan Teknologi Augmented Real...
Aplikasi Media Pemasaran Properti dengan menggunakan Teknologi Augmented Real...
Detecting Malicious Websites using Machine Learning par Andrew Beard
Detecting Malicious Websites using Machine LearningDetecting Malicious Websites using Machine Learning
Detecting Malicious Websites using Machine Learning
Andrew Beard1.1K vues
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition par SingleStore
Spark Summit Dublin 2017 - MemSQL - Real-Time Image RecognitionSpark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
Spark Summit Dublin 2017 - MemSQL - Real-Time Image Recognition
SingleStore843 vues
Faster! Faster! Accelerate your business with blazing prototypes par OSCON Byrum
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
OSCON Byrum4.5K vues
The Technology Of Augmented Reality par Christy Davis
The Technology Of Augmented RealityThe Technology Of Augmented Reality
The Technology Of Augmented Reality
Christy Davis3 vues
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx par Luis Beltran
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
Luis Beltran171 vues

Plus de TelecomValley

Rapport d'activité SoFAB 2022 par
Rapport d'activité SoFAB 2022Rapport d'activité SoFAB 2022
Rapport d'activité SoFAB 2022TelecomValley
31 vues14 diapositives
Rapport d'activité 2022 par
Rapport d'activité 2022Rapport d'activité 2022
Rapport d'activité 2022TelecomValley
50 vues28 diapositives
Rapport d'activité 2021 - Telecom Valley par
Rapport d'activité 2021 - Telecom ValleyRapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom ValleyTelecomValley
364 vues28 diapositives
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la... par
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...TelecomValley
456 vues48 diapositives
Rapport d'activité SoFAB 2020 par
Rapport d'activité SoFAB 2020Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020TelecomValley
175 vues11 diapositives
Rapport d'activité Telecom Valley 2020 par
Rapport d'activité Telecom Valley 2020Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020TelecomValley
462 vues17 diapositives

Plus de TelecomValley(20)

Rapport d'activité 2021 - Telecom Valley par TelecomValley
Rapport d'activité 2021 - Telecom ValleyRapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom Valley
TelecomValley364 vues
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la... par TelecomValley
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
TelecomValley456 vues
Rapport d'activité SoFAB 2020 par TelecomValley
Rapport d'activité SoFAB 2020Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020
TelecomValley175 vues
Rapport d'activité Telecom Valley 2020 par TelecomValley
Rapport d'activité Telecom Valley 2020Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020
TelecomValley462 vues
Rapport d'activité SoFAB 2019 par TelecomValley
Rapport d'activité SoFAB 2019Rapport d'activité SoFAB 2019
Rapport d'activité SoFAB 2019
TelecomValley186 vues
Rapport d'activité Telecom Valley 2019 par TelecomValley
Rapport d'activité Telecom Valley 2019Rapport d'activité Telecom Valley 2019
Rapport d'activité Telecom Valley 2019
TelecomValley547 vues
Revue de presse Telecom Valley - Février 2020 par TelecomValley
Revue de presse Telecom Valley - Février 2020Revue de presse Telecom Valley - Février 2020
Revue de presse Telecom Valley - Février 2020
TelecomValley207 vues
Revue de presse Telecom Valley - Janvier 2020 par TelecomValley
Revue de presse Telecom Valley - Janvier 2020Revue de presse Telecom Valley - Janvier 2020
Revue de presse Telecom Valley - Janvier 2020
TelecomValley150 vues
Revue de presse Telecom Valley - Décembre 2019 par TelecomValley
Revue de presse Telecom Valley - Décembre 2019Revue de presse Telecom Valley - Décembre 2019
Revue de presse Telecom Valley - Décembre 2019
TelecomValley122 vues
Revue de presse Telecom Valley - Novembre 2019 par TelecomValley
Revue de presse Telecom Valley - Novembre 2019Revue de presse Telecom Valley - Novembre 2019
Revue de presse Telecom Valley - Novembre 2019
TelecomValley100 vues
Revue de presse Telecom Valley - Octobre 2019 par TelecomValley
Revue de presse Telecom Valley - Octobre 2019Revue de presse Telecom Valley - Octobre 2019
Revue de presse Telecom Valley - Octobre 2019
TelecomValley85 vues
Revue de presse Telecom Valley - Septembre 2019 par TelecomValley
Revue de presse Telecom Valley - Septembre 2019Revue de presse Telecom Valley - Septembre 2019
Revue de presse Telecom Valley - Septembre 2019
TelecomValley85 vues
Présentation Team France Export régionale - 29/11/19 par TelecomValley
Présentation Team France Export régionale - 29/11/19Présentation Team France Export régionale - 29/11/19
Présentation Team France Export régionale - 29/11/19
TelecomValley228 vues
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie... par TelecomValley
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
TelecomValley364 vues
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi... par TelecomValley
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
TelecomValley190 vues
Et si mon test était la spécification de mon application ? - JACOB - iWE - So... par TelecomValley
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
TelecomValley218 vues
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE par TelecomValley
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFEA la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
TelecomValley240 vues
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1 par TelecomValley
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.12019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
TelecomValley229 vues

Dernier

"Running students' code in isolation. The hard way", Yurii Holiuk par
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
36 vues34 diapositives
Why and How CloudStack at weSystems - Stephan Bienek - weSystems par
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsShapeBlue
238 vues13 diapositives
Evaluation of Quality of Experience of ABR Schemes in Gaming Stream par
Evaluation of Quality of Experience of ABR Schemes in Gaming StreamEvaluation of Quality of Experience of ABR Schemes in Gaming Stream
Evaluation of Quality of Experience of ABR Schemes in Gaming StreamAlpen-Adria-Universität
38 vues34 diapositives
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue par
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
138 vues15 diapositives
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue par
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
135 vues13 diapositives
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... par
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...BookNet Canada
41 vues16 diapositives

Dernier(20)

"Running students' code in isolation. The hard way", Yurii Holiuk par Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays36 vues
Why and How CloudStack at weSystems - Stephan Bienek - weSystems par ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 vues
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue par ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 vues
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue par ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 vues
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... par BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada41 vues
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... par ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue161 vues
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online par ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 vues
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023 par BookNet Canada
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
Redefining the book supply chain: A glimpse into the future - Tech Forum 2023
BookNet Canada44 vues
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... par ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 vues
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue par ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 vues
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... par ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 vues
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... par Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... par ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue198 vues
LLMs in Production: Tooling, Process, and Team Structure par Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage42 vues
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... par ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 vues
The Role of Patterns in the Era of Large Language Models par Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li85 vues

SophiaConf 2018 - J. Rahajarison (My Little Adventure)

  • 1. Smart recommendation engine of things to do in destination Natural Language Processing and Machine Learning How to automatically categorize tours and activities ? July 2nd 2018
  • 3. Agenda Introduction to machine learning Why Natural Language Processing is so hard? How do we process text? Let’s try it out Go further 3
  • 4. What’s Machine Learning ? Software that do something without being explicitly programmed to, just by learning through examples Same software can be used for various tasks It learns from experiences with respect to some task and performance, and improves through experience 4
  • 7. You said text, right? 7
  • 8. Obviously, you said text Not numbers ContextPolysemy Synonyms Enantiosemy Neologisms Sarcasm Names Rare words Common sense Dialects Non formal / abbrev. 8
  • 9. Ambiguity? 9 I saw a man on a hill with a telescope.
  • 10. Ambiguity? 10 I saw a man on a hill with a telescope.
  • 11. Text should be prepared 11
  • 12. Let’s clean our text first ['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessli', 'as', 'he', 'look', 'what', "'s", 'happen', ‘to'] ✓ Tokenize sentences ✓ Tokenize words ✓ Transliterate ✓ Normalize ✓ Filter out 
 (punctuation, special characters, stop words) ✓ Use a stemmer and / or a lemmatizer
 ("be" = am, are, is; “vari" = variation, vary, varies, variables) 12
  • 13. A bag of words “John","likes","to","watch","movies","Mary","likes","movies","too" {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Each unique word in our dictionary will correspond to a feature 13
  • 14. Count of documents TF-IDF TF (Term Frequencies) Occurrences of a term IDF (Inverse Document Frequency) log( )Count of documents where terms appear Total words in each document 14
  • 15. Another way: use words embeddings Words embeddings captures relative meaning Use vectors to get comprehensive geometry of words 15
  • 16. Paris - France + China = Beijing Another way: use words embeddings 16
  • 17. Example of “movies" vector movies -0.34582 0.057328 0.1328 0.22376 0.10161 0.52948 -0.30199 0.45676 -0.37643 -0.51857 0.67325 -0.012444 -0.099021 0.43823 -0.28905 -1.0183 -0.0062387 -0.32893 0.55547 0.44181 0.31524 0.29909 0.51605 0.32109 0.021471 0.67909 0.037333 -0.42321 0.56517 0.47979 -0.63307 0.1126 0.0050579 -0.18879 -0.87478 -0.29481 -0.70824 -0.072256 0.1614 0.34523 0.61872 -0.036932 -0.43343 0.29604 0.18671 -0.33384 0.50628 -0.013876 0.46303 0.19298 0.16783 -0.55786 -0.16947 -0.27382 0.31027 0.10974 0.12819 0.23538 0.038003 -0.077524 -0.23291 0.044094 0.36325 0.20611 0.55571 -0.022715 -0.04996 0.32312 0.44176 0.25272 0.15159 0.22682 -0.10425 0.73375 0.66572 -0.55885 0.082242 -0.13387 0.31042 -0.38443 -0.38631 -0.7518 0.6706 -0.17495 0.056298 0.82038 0.41573 -0.12316 0.28437 -0.19324 -0.13485 0.28862 -0.37817 0.37268 0.01515 0.39123 0.059544 -0.074006 -0.17152 -1.1523 0.26541 0.082314 0.17914 -0.089861 -0.20884 0.29248 -0.60263 -0.0024285 0.24521 -0.5427 -0.074404 0.14034 0.0085891 -0.37351 0.23573 0.1493 -0.14038 0.11725 -0.51013 -0.64531 0.1329 0.075911 -0.10827 0.22077 -0.086253 0.4096 0.052314 0.40964 -0.030506 0.30572 -0.40694 -0.11773 0.21586 0.14448 0.23419 -0.23401 0.06811 0.29447 -0.4086 0.88777 -0.19477 -0.18847 0.10324 -0.24593 -0.10173 -0.43226 -0.091173 -0.092602 -0.23385 -0.16498 0.22057 0.11014 -0.25018 -0.43089 0.19759 0.11762 -0.045432 0.13331 0.032684 -0.21702 0.35082 -0.40466 -0.02425 -0.22637 0.0094442 0.72848 0.10286 0.27199 -0.40396 0.22366 -0.039481 -0.17164 -1.7307 0.3706 -0.13711 0.2295 -0.34432 -0.024381 -0.093941 -0.29861 -0.33164 -0.12931 -0.11218 0.047052 0.40442 0.0043382 0.22364 -0.31537 0.1987 -0.46108 -0.35126 -0.14584 0.17765 0.10869 -0.14434 -0.6152 -0.5874 0.014977 -0.1691 -0.46926 1.3959 -0.15449 -0.24167 -0.002575 0.4758 -0.044786 -0.21345 0.22983 -0.34356 -0.43402 -0.45719 -0.29775 -0.053295 0.50132 -0.24066 0.45762 0.095118 0.21008 0.71912 0.028577 -0.64176 0.1314 0.21556 -0.12536 -0.3298 -0.07123 0.35428 -0.3787 0.12348 -0.060439 0.19217 -0.29951 -0.73189 -0.33589 0.449 0.22654 1.0404 0.019947 -0.74711 0.071042 0.067809 0.36341 -0.32579 -0.11085 -0.24507 -0.13518 -0.44326 0.022784 -0.57252 0.33756 -0.23411 -0.062955 -0.35353 1.0497 -0.14938 -0.57772 0.27652 -0.28787 -0.0040621 0.25113 0.40818 -0.13227 0.016032 -0.55465 0.0021098 -0.27755 0.16082 -0.055202 0.21104 0.58412 0.42842 -0.047253 0.10542 0.027478 0.30911 0.31792 -1.8564 0.014412 -0.29748 -0.70103 -0.068219 -0.53071 -0.10661 0.028596 0.081479 0.34323 -0.047833 0.023129 0.028697 0.33859 -0.20706 -0.0025571 -0.18267 -0.26946 -1.1064 -0.31228 -0.13101 0.1161 -0.068647 -0.09988 Another way: use words embeddings 17
  • 18. [[], 2*[], [], [], 2 *[-0.34582, 0.057328, … 0.22376, 0.10161], [], []] {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Another way: use words embeddings Embeddings vector for “movies" 18
  • 20. Recipe Prepare Training / Test data Files, database, cache, data flow Selection of model, and (hyper) parameters Train algorithm Use or store your trained estimator Make predictions Measure accuracy precision Measure 20
  • 21. Collect our training & test dataset Food Label Vectorized Eiffel Tower with Dinner [ 0., 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0., 0.5], Skip the line Eiffel Tower [ 0., 0., 0., 0., 0., 0.3967171 , 0., 0., 0., 0.47792296, 0., 0., 0., 0., 0., 0.47792296, 0.47792296, 0., 0., 0.3967171 , 0., 0.], Louvre Museum fast track [ 0., 0., 0., 0., 0., 0., 0.5, 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0.], Gourmet tour of Paris [ 0., 0., 0., 0., 0., 0., 0., 0.58910044, 0., 0., 0., 0., 0.41798437, 0.48900396, 0., 0., 0., 0., 0.48900396, 0., 0., 0.], Segway tour of city’s highlights [ 0., 0., 0.48838773, 0., 0., 0., 0., 0., 0.48838773, 0., 0., 0., 0.3465257 , 0., 0.48838773, 0., 0., 0., 0.40540376, 0., 0., 0.], Dinner cruise with Champagne [ 0., 0.54408243, 0., 0.54408243, 0.45163515, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.45163515], Aquarium of Paris ticket [ 0.55967542, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.39710644, 0.46457866, 0., 0., 0., 0.55967542, 0., 0., 0., 0.] … … 21
  • 22. Choose a classifier algorithm 22
  • 23. A few recommendations Naive Bayes / Logistic Regression Decision Trees Random Forest Gradient Boosting SVM Neural Networks 23
  • 24. Let’s measure Food Label Prediction Eiffel Tower with Dinner 0.83 Gourmet tour of Paris 0.96 Dinner cruise with Champagne 1.0 Segway tour of city’s highlights 0.03 Orsay dedicated entrance 0.02 3 course meal in Eiffel Tower 0.97 Cooking class in Paris 0.89 Moulin Rouge Paris dinner show 0.91 24 Training set Real datas
  • 25. 25
  • 27. There is way more Cross validation dataset N-Grams Wrong user content Misspellings & typos Hard to get training data Harder languages or transliterations issues Memory / computing limitations Online learning & Stacking 27
  • 29. Thank you July 2nd 2018 Questions ? @mylitadventure @brainstorm_me johnny.rahajarison@mylittleadventure.com