Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

SophiaConf 2018 - J. Rahajarison (My Little Adventure)

58 vues

Publié le

Support de présentation : Natural Language processing and Machine Learning

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

SophiaConf 2018 - J. Rahajarison (My Little Adventure)

  1. 1. Smart recommendation engine of things to do in destination Natural Language Processing and Machine Learning How to automatically categorize tours and activities ? July 2nd 2018
  2. 2. Introduction MyLittleAdventure @mylitadventure Johnny RAHAJARISON @brainstorm_me johnny.rahajarison@mylittleadventure.com 2
  3. 3. Agenda Introduction to machine learning Why Natural Language Processing is so hard? How do we process text? Let’s try it out Go further 3
  4. 4. What’s Machine Learning ? Software that do something without being explicitly programmed to, just by learning through examples Same software can be used for various tasks It learns from experiences with respect to some task and performance, and improves through experience 4
  5. 5. Unsupervised algorithms Unsupervised algorithms ClusteringAnomaly detection 5
  6. 6. Supervised algorithms Supervised algorithms ClassificationRegression 6
  7. 7. You said text, right? 7
  8. 8. Obviously, you said text Not numbers ContextPolysemy Synonyms Enantiosemy Neologisms Sarcasm Names Rare words Common sense Dialects Non formal / abbrev. 8
  9. 9. Ambiguity? 9 I saw a man on a hill with a telescope.
  10. 10. Ambiguity? 10 I saw a man on a hill with a telescope.
  11. 11. Text should be prepared 11
  12. 12. Let’s clean our text first ['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessli', 'as', 'he', 'look', 'what', "'s", 'happen', ‘to'] ✓ Tokenize sentences ✓ Tokenize words ✓ Transliterate ✓ Normalize ✓ Filter out 
 (punctuation, special characters, stop words) ✓ Use a stemmer and / or a lemmatizer
 ("be" = am, are, is; “vari" = variation, vary, varies, variables) 12
  13. 13. A bag of words “John","likes","to","watch","movies","Mary","likes","movies","too" {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Each unique word in our dictionary will correspond to a feature 13
  14. 14. Count of documents TF-IDF TF (Term Frequencies) Occurrences of a term IDF (Inverse Document Frequency) log( )Count of documents where terms appear Total words in each document 14
  15. 15. Another way: use words embeddings Words embeddings captures relative meaning Use vectors to get comprehensive geometry of words 15
  16. 16. Paris - France + China = Beijing Another way: use words embeddings 16
  17. 17. Example of “movies" vector movies -0.34582 0.057328 0.1328 0.22376 0.10161 0.52948 -0.30199 0.45676 -0.37643 -0.51857 0.67325 -0.012444 -0.099021 0.43823 -0.28905 -1.0183 -0.0062387 -0.32893 0.55547 0.44181 0.31524 0.29909 0.51605 0.32109 0.021471 0.67909 0.037333 -0.42321 0.56517 0.47979 -0.63307 0.1126 0.0050579 -0.18879 -0.87478 -0.29481 -0.70824 -0.072256 0.1614 0.34523 0.61872 -0.036932 -0.43343 0.29604 0.18671 -0.33384 0.50628 -0.013876 0.46303 0.19298 0.16783 -0.55786 -0.16947 -0.27382 0.31027 0.10974 0.12819 0.23538 0.038003 -0.077524 -0.23291 0.044094 0.36325 0.20611 0.55571 -0.022715 -0.04996 0.32312 0.44176 0.25272 0.15159 0.22682 -0.10425 0.73375 0.66572 -0.55885 0.082242 -0.13387 0.31042 -0.38443 -0.38631 -0.7518 0.6706 -0.17495 0.056298 0.82038 0.41573 -0.12316 0.28437 -0.19324 -0.13485 0.28862 -0.37817 0.37268 0.01515 0.39123 0.059544 -0.074006 -0.17152 -1.1523 0.26541 0.082314 0.17914 -0.089861 -0.20884 0.29248 -0.60263 -0.0024285 0.24521 -0.5427 -0.074404 0.14034 0.0085891 -0.37351 0.23573 0.1493 -0.14038 0.11725 -0.51013 -0.64531 0.1329 0.075911 -0.10827 0.22077 -0.086253 0.4096 0.052314 0.40964 -0.030506 0.30572 -0.40694 -0.11773 0.21586 0.14448 0.23419 -0.23401 0.06811 0.29447 -0.4086 0.88777 -0.19477 -0.18847 0.10324 -0.24593 -0.10173 -0.43226 -0.091173 -0.092602 -0.23385 -0.16498 0.22057 0.11014 -0.25018 -0.43089 0.19759 0.11762 -0.045432 0.13331 0.032684 -0.21702 0.35082 -0.40466 -0.02425 -0.22637 0.0094442 0.72848 0.10286 0.27199 -0.40396 0.22366 -0.039481 -0.17164 -1.7307 0.3706 -0.13711 0.2295 -0.34432 -0.024381 -0.093941 -0.29861 -0.33164 -0.12931 -0.11218 0.047052 0.40442 0.0043382 0.22364 -0.31537 0.1987 -0.46108 -0.35126 -0.14584 0.17765 0.10869 -0.14434 -0.6152 -0.5874 0.014977 -0.1691 -0.46926 1.3959 -0.15449 -0.24167 -0.002575 0.4758 -0.044786 -0.21345 0.22983 -0.34356 -0.43402 -0.45719 -0.29775 -0.053295 0.50132 -0.24066 0.45762 0.095118 0.21008 0.71912 0.028577 -0.64176 0.1314 0.21556 -0.12536 -0.3298 -0.07123 0.35428 -0.3787 0.12348 -0.060439 0.19217 -0.29951 -0.73189 -0.33589 0.449 0.22654 1.0404 0.019947 -0.74711 0.071042 0.067809 0.36341 -0.32579 -0.11085 -0.24507 -0.13518 -0.44326 0.022784 -0.57252 0.33756 -0.23411 -0.062955 -0.35353 1.0497 -0.14938 -0.57772 0.27652 -0.28787 -0.0040621 0.25113 0.40818 -0.13227 0.016032 -0.55465 0.0021098 -0.27755 0.16082 -0.055202 0.21104 0.58412 0.42842 -0.047253 0.10542 0.027478 0.30911 0.31792 -1.8564 0.014412 -0.29748 -0.70103 -0.068219 -0.53071 -0.10661 0.028596 0.081479 0.34323 -0.047833 0.023129 0.028697 0.33859 -0.20706 -0.0025571 -0.18267 -0.26946 -1.1064 -0.31228 -0.13101 0.1161 -0.068647 -0.09988 Another way: use words embeddings 17
  18. 18. [[], 2*[], [], [], 2 *[-0.34582, 0.057328, … 0.22376, 0.10161], [], []] {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Another way: use words embeddings Embeddings vector for “movies" 18
  19. 19. Let’s predict 19
  20. 20. Recipe Prepare Training / Test data Files, database, cache, data flow Selection of model, and (hyper) parameters Train algorithm Use or store your trained estimator Make predictions Measure accuracy precision Measure 20
  21. 21. Collect our training & test dataset Food Label Vectorized Eiffel Tower with Dinner [ 0., 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0., 0.5], Skip the line Eiffel Tower [ 0., 0., 0., 0., 0., 0.3967171 , 0., 0., 0., 0.47792296, 0., 0., 0., 0., 0., 0.47792296, 0.47792296, 0., 0., 0.3967171 , 0., 0.], Louvre Museum fast track [ 0., 0., 0., 0., 0., 0., 0.5, 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0.], Gourmet tour of Paris [ 0., 0., 0., 0., 0., 0., 0., 0.58910044, 0., 0., 0., 0., 0.41798437, 0.48900396, 0., 0., 0., 0., 0.48900396, 0., 0., 0.], Segway tour of city’s highlights [ 0., 0., 0.48838773, 0., 0., 0., 0., 0., 0.48838773, 0., 0., 0., 0.3465257 , 0., 0.48838773, 0., 0., 0., 0.40540376, 0., 0., 0.], Dinner cruise with Champagne [ 0., 0.54408243, 0., 0.54408243, 0.45163515, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.45163515], Aquarium of Paris ticket [ 0.55967542, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.39710644, 0.46457866, 0., 0., 0., 0.55967542, 0., 0., 0., 0.] … … 21
  22. 22. Choose a classifier algorithm 22
  23. 23. A few recommendations Naive Bayes / Logistic Regression Decision Trees Random Forest Gradient Boosting SVM Neural Networks 23
  24. 24. Let’s measure Food Label Prediction Eiffel Tower with Dinner 0.83 Gourmet tour of Paris 0.96 Dinner cruise with Champagne 1.0 Segway tour of city’s highlights 0.03 Orsay dedicated entrance 0.02 3 course meal in Eiffel Tower 0.97 Cooking class in Paris 0.89 Moulin Rouge Paris dinner show 0.91 24 Training set Real datas
  25. 25. 25
  26. 26. Go further 26
  27. 27. There is way more Cross validation dataset N-Grams Wrong user content Misspellings & typos Hard to get training data Harder languages or transliterations issues Memory / computing limitations Online learning & Stacking 27
  28. 28. Some resources https://www.slideshare.net/mylittleadventure/introduction-machine-learning-by-mylittleadventure http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html https://bit.ly/2uL954v NLTK Book Stanford’s GloVe DatasetCourse Andrew Ng (coursera) Platform 28 Libraries
  29. 29. Thank you July 2nd 2018 Questions ? @mylitadventure @brainstorm_me johnny.rahajarison@mylittleadventure.com

×