Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Visual-Semantic Embeddings: some thoughts on Language

3 886 vues

Publié le

Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.

The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Visual-Semantic Embeddings: some thoughts on Language

  1. 1. 1 @graphific Roelof Pieters Visual-­‐Seman3c   Embeddings:
 Some  Thoughts  on  Language 23  January  2015  
 KTH FEEDA www.csc.kth.se/~roelof/ roelof@kth.se
  2. 2. Language Understanding 2
  3. 3. Can we understand Language ? 1. Language is ambiguous:
 Every sentence has many possible interpretations. 2. Language is productive:
 We will always encounter new words or new constructions 3. Language is culturally specific Some of the challenges in Language Understanding: 3
  4. 4. Can we understand Language ? 1. Language is ambiguous:
 Every sentence has many possible interpretations. 2. Language is productive:
 We will always encounter new words or new constructions • plays well with others VB ADV P NN NN NN P DT • fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN • the students went to class DT NN VB P NN 4 Some of the challenges in Language Understanding:
  5. 5. Can we understand Language ? 1. Language is ambiguous:
 Every sentence has many possible interpretations. 2. Language is productive:
 We will always encounter new words or new constructions 5 Some of the challenges in Language Understanding:
  6. 6. [Karlgren 2014, NLP Sthlm Meetup]6
  7. 7. Can we understand Language ? 1. Language is ambiguous:
 Every sentence has many possible interpretations. 2. Language is productive:
 We will always encounter new words or new constructions 3. Language is culturally specific Some of the challenges in Language Understanding: 7
  8. 8. ML: Traditional Approach 1. Gather as much LABELED data as you can get 2. Throw some algorithms at it (mainly put in an SVM and keep it at that) 3. If you actually have tried more algos: Pick the best 4. Spend hours hand engineering some features / feature selection / dimensionality reduction (PCA, SVD, etc) 5. Repeat… For each new problem/question:: 8
  9. 9. Machine Learning for NLP Data Classic Approach: Data is fed into a learning algorithm: Learning 
 Algorithm 9
  10. 10. Machine Learning for NLP some of the (many) treebank datasets source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks ! 10
  11. 11. Penn Treebank That’s a lot of “manual” work: 11
  12. 12. • the students went to class DT NN VB P NN • plays well with others VB ADV P NN NN NN P DT • fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN With a lot of issues: Penn Treebank 12
  13. 13. Machine Learning for NLP Learning 
 Algorithm Data “Features” Prediction Prediction/
 Classifier train set test set 13
  14. 14. Machine Learning for NLP Learning 
 Algorithm “Features” Prediction Prediction/
 Classifier train set test set 14
  15. 15. Deep Learning: Why for NLP ? Beat state of the art at: • Language Modeling (Mikolov et al. 2011) [WSJ AR task] • Speech Recognition (Dahl et al. 2012, Seide et al 2011; following Mohammed et al. 2011) • Sentiment Classification (Socher et al. 2011) and we already know for some longer time it works well for Computer Vision of course: • MNIST hand-written digit recognition (Ciresan et al. 2010) • Image Recognition (Krizhevsky et al. 2012) [ImageNet] 15
  16. 16. One Model rules them all ?
 
 DL approaches have been successfully applied to: Deep Learning: Why for NLP ? Automatic summarization Coreference resolution Discourse analysis Machine translation Morphological segmentation Named entity recognition (NER) Natural language generation Natural language understanding Optical character recognition (OCR) Part-of-speech tagging Parsing Question answering Relationship extraction sentence boundary disambiguation Sentiment analysis Speech recognition Speech segmentation Topic segmentation and recognition Word segmentation Word sense disambiguation Information retrieval (IR) Information extraction (IE) Speech processing 16
  17. 17. • What is the meaning of a word?
 (Lexical semantics) • What is the meaning of a sentence?
 ([Compositional] semantics) • What is the meaning of a longer piece of text? (Discourse semantics) Semantics: Meaning 17
  18. 18. • NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:
 • or in vector space:
 • also known as “one hot” representation. • Its problem ? Word Representation Love Candy Store [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 ! 18
  19. 19. • Structure corresponds to meaning: Structure and Meaning 19
  20. 20. • Semantics • Syntax 20 NLP: what can we work with?
  21. 21. • Language models define probability distributions over (natural language) strings or sentences • Joint and Conditional Probability Language Model 21
  22. 22. • Language models define probability distributions over (natural language) strings or sentences Language Model 22
  23. 23. • Language models define probability distributions over (natural language) strings or sentences Language Model 23
  24. 24. Word senses What is the meaning of words? • Most words have many different senses:
 dog = animal or sausage? How are the meanings of different words related? • - Specific relations between senses:
 Animal is more general than dog. • - Semantic fields:
 money is related to bank 24
  25. 25. Word senses Polysemy: • A lexeme is polysemous if it has different related senses • bank = financial institution or building Homonyms: • Two lexemes are homonyms if their senses are unrelated, but they happen to have the same spelling and pronunciation • bank = (financial) bank or (river) bank 25
  26. 26. Word senses: relations Symmetric relations: • Synonyms: couch/sofa
 Two lemmas with the same sense • Antonyms: cold/hot, rise/fall, in/out
 Two lemmas with the opposite sense Hierarchical relations: • Hypernyms and hyponyms: pet/dog
 The hyponym (dog) is more specific than the hypernym (pet) • Holonyms and meronyms: car/wheel
 The meronym (wheel) is a part of the holonym (car) 26
  27. 27. Distributional representations “You shall know a word by the company it keeps”
 (J. R. Firth 1957) One of the most successful ideas of modern statistical NLP! these words represent banking • Hard (class based) clustering models • Soft clustering models 27
  28. 28. Distributional hypothesis He filled the wampimuk, passed it around and we all drunk some We found a little, hairy wampimuk sleeping behind the tree (McDonald & Ramscar 2001) 28
  29. 29. Distributional semantics Landauer and Dumais (1997), Turney and Pantel (2010), … 29
  30. 30. Distributional semantics Distributional meaning as co-occurrence vector: 30
  31. 31. Distributional representations • Taking it further: • Continuous word embeddings • Combine vector space semantics with the prediction of probabilistic models • Words are represented as a dense vector: Candy = 31
  32. 32. Word Embeddings: SocherVector Space Model adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: 32
  33. 33. Word Embeddings: SocherVector Space Model adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: the country of my birth the place where I was born 33
  34. 34. • Can theoretically (given enough units) approximate “any” function • and fit to “any” kind of data • Efficient for NLP: hidden layers can be used as word lookup tables • Dense distributed word vectors + efficient NN training algorithms: • Can scale to billions of words ! Why Neural Networks for NLP? 34
  35. 35. • Representation of words as continuous vectors has a long history (Hinton et al. 1986; Rumelhart et al. 1986; Elman 1990) • First neural network language model: NNLM (Bengio et al. 2001; Bengio et al. 2003) based on earlier ideas of distributed representations for symbols (Hinton 1986) How? 35
  36. 36. Word Embeddings: SocherVector Space Model Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: the country of my birth the place where I was born ? … 36
  37. 37. Compositionality Principle of compositionality: the “meaning (vector) of a complex expression (sentence) is determined by: — Gottlob Frege 
 (1848 - 1925) - the meanings of its constituent expressions (words) and - the rules (grammar) used to combine them” 37
  38. 38. • How do we handle the compositionality of language in our models? 38 Compositionality
  39. 39. • How do we handle the compositionality of language in our models? • Recursion :
 the same operator (same parameters) is applied repeatedly on different components 39 Compositionality
  40. 40. • How do we handle the compositionality of language in our models? • Option 1: Recurrent Neural Networks (RNN) 40 RNN 1: Recurrent Neural Networks
  41. 41. • How do we handle the compositionality of language in our models? • Option 2: Recursive Neural Networks (also sometimes called RNN) 41 RNN 2: Recursive Neural Networks
  42. 42. • achieved SOTA in 2011 on Language Modeling (WSJ AR task) (Mikolov et al., INTERSPEECH 2011): • and again at ASRU 2011: 42 Recurrent Neural Networks “Comparison to other LMs shows that RNN LMs are state of the art by a large margin. Improvements inrease with more training data.” “[ RNN LM trained on a] single core on 400M words in a few days, with 1% absolute improvement in WER on state of the art setup” Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)
 Recurrent neural network based language model
  43. 43. 43 Recurrent Neural Networks (simple recurrent 
 neural network for LM) input hidden layer(s) output layer + sigmoid activation function + softmax function: Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)
 Recurrent neural network based language model
  44. 44. 44 Recurrent Neural Networks backpropagation through time
  45. 45. 45 Recurrent Neural Networks backpropagation through time class based recurrent NN [code (Mikolov’s RNNLM Toolkit) and more info: http://rnnlm.org/ ]
  46. 46. • Recursive Neural Network for LM (Socher et al. 2011; Socher 2014) • achieved SOTA on new Stanford Sentiment Treebank dataset (but comparing it to many other models): Recursive Neural Network 46 Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank info & code: http://nlp.stanford.edu/sentiment/
  47. 47. Recursive Neural Tensor Network 47 code & info: http://www.socher.org/index.php/Main/ ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011) 
 Parsing Natural Scenes and Natural Language with Recursive Neural Networks
  48. 48. Recursive Neural Tensor Network 48
  49. 49. • RNN (Socher et al. 2011a) Recursive Neural Network 49 Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank info & code: http://nlp.stanford.edu/sentiment/
  50. 50. • RNN (Socher et al. 2011a) • Matrix-Vector RNN (MV-RNN) (Socher et al., 2012) Recursive Neural Network 50 Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank info & code: http://nlp.stanford.edu/sentiment/
  51. 51. • RNN (Socher et al. 2011a) • Matrix-Vector RNN (MV-RNN) (Socher et al., 2012) • Recursive Neural Tensor Network (RNTN) (Socher et al. 2013) Recursive Neural Network 51
  52. 52. • negation detection: Recursive Neural Network 52 Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank info & code: http://nlp.stanford.edu/sentiment/
  53. 53. NP PP/IN NP DT NN PRP$ NN Parse Tree Recurrent NN for Vector Space 53
  54. 54. NP PP/IN NP DT NN PRP$ NN Parse Tree INDT NN PRP NN Compositionality 54 Recurrent NN: CompositionalityRecurrent NN for Vector Space
  55. 55. NP IN NP PRP NN Parse Tree DT NN Compositionality 55 Recurrent NN: CompositionalityRecurrent NN for Vector Space
  56. 56. NP IN NP DT NN PRP NN PP NP (S / ROOT) “rules” “meanings” Compositionality 56 Recurrent NN: CompositionalityRecurrent NN for Vector Space
  57. 57. Vector Space + Word Embeddings: Socher 57 Recurrent NN: CompositionalityRecurrent NN for Vector Space
  58. 58. Vector Space + Word Embeddings: Socher 58 Recurrent NN for Vector Space
  59. 59. Word Embeddings: Turian (2010) Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning code & info: http://metaoptimize.com/projects/wordreprs/59
  60. 60. Word Embeddings: Turian (2010) Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning code & info: http://metaoptimize.com/projects/wordreprs/ 60
  61. 61. Word Embeddings: Collobert & Weston (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch 61
  62. 62. Multi-embeddings: Stanford (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)
 Improving Word Representations via Global Context and Multiple Word Prototypes 62
  63. 63. Linguistic Regularities: Mikolov (2013) code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations 63
  64. 64. Word Embeddings for MT: Mikolov (2013) Mikolov, T., Le, V. L., Sutskever, I. (2013) . 
 Exploiting Similarities among Languages for Machine Translation 64
  65. 65. Recursive Deep Models & Sentiment: Socher (2013) Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) 
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. code & demo: http://nlp.stanford.edu/sentiment/index.html 65
  66. 66. Paragraph Vectors: Le & Mikolov (2014) Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents 66 • add context (sentence, paragraph, document) to word vectors during training ! Results on Stanford Sentiment 
 Treebank dataset:
  67. 67. Global Vectors, GloVe: Stanford (2014) Pennington, P., Socher, R., Manning,. D.M. (2014). 
 GloVe: Global Vectors for Word Representation code & demo: http://nlp.stanford.edu/projects/glove/ vs results on the word analogy task “similar accuracy” 67
  68. 68. Dependency-based Embeddings: Levy & Goldberg (2014) Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word- embeddings/ - Syntactic Dependency Context Australian scientist discovers star with telescope - Bag of Words (BoW) Context 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ 0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$ Precision$ Recall$ “Dependency-based embeddings have more functional similarities” 68
  69. 69. Joint Image-Word Embeddings 69
  70. 70. 1. Multimodal representation learning 2. Generating descriptions of images 3. Ranking images and captions (“image-sentence ranking”) 4. Evaluation Methods? Some Current Approaches 70
  71. 71. • Learning joint image-word embeddings in a low- dimension embedding space (Weston et al 2010; Frome et al. 2013) • Embedding images and sentences into a common space (Socher et al. TACL 2014; Karpathy et al. NIPS 2014) • also: deep Boltzmann machines (Srivastava&Salakhutdinov NIPS 2012), log-bilinear neural language models (Kiros et al. ICML 2014), autoencoders (Ngiam ICML 2011), recurrent neural networks (m-RNN: Mao et al. 2014) and topic-models (Jia ICCV 2011) 1. Multimodal representation learning 71
  72. 72. • Neural network methods • Composition-based methods. • Template-based methods. 2. Generating descriptions of images 72
  73. 73. • Bi-directional approaches (Hockenmaier group): kernel CCA (Hodosh et al. JAIR 2013), normalized CCA (Gong et al. EECV 2014) • Dependency tree recursive networks (Stanford) (Socher et al. TACL 2014) 3. Ranking images and captions 73
  74. 74. • Current evaluation methods unreliable and don’t match with human judgements • Bleu • Rouge • Perplexity?? • Ranking as proxy for generation / scoring function 4. Evaluation? 74
  75. 75. • unsupervised pre-training on many images • in parallel train a neural network Language Model • train linear mapping between image representations and word embeddings, representing the different “classes” 75 Zero-shot Learning
  76. 76. DeViSE model (Frome et al. 2013) • skip-gram text model on wikipedia corpus of 5.7 million documents (5.4 billion words) - approach from (Mikolov et al. ICLR 2013) 76 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013) 
 Devise: A deep visual-semantic embedding model
  77. 77. Encoder: A deep convolutional network (CNN) and long short- term memory recurrent network (LSTM) for learning a joint image-sentence embedding. Decoder: A new neural language model that combines structure and content vectors for generating words one at a time in sequence. Encoder-Decoder pipeline (Kiros et al 2014) 77 Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) 
 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
  78. 78. Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) 
 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models • matches state-of-the-art performance on Flickr8K and Flickr30K without using object detections • new best results when using the 19-layer Oxford convolutional network. • linear encoders: learned embedding space captures multimodal regularities (e.g. *image of a blue car* - "blue" + "red" is near images of red cars) Encoder-Decoder pipeline (Kiros et al 2014) 78
  79. 79. demo time
 (code adapted from: https://github.com/karpathy/ neuraltalk) 79
  80. 80. • Theano - CPU/GPU symbolic expression compiler in python (from LISA lab at University of Montreal). http:// deeplearning.net/software/theano/ • Pylearn2 - library designed to make machine learning research easy. http://deeplearning.net/software/pylearn2/ • Torch - Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu) http://torch.ch/ • more info: http://deeplearning.net/software links/ Wanna Play ? Wanna Play ? General Deep Learning 80
  81. 81. • RNNLM (Mikolov)
 http://rnnlm.org • NB-SVM
 https://github.com/mesnilgr/nbsvm • Word2Vec (skipgrams/cbow)
 https://code.google.com/p/word2vec/ (original)
 http://radimrehurek.com/gensim/models/word2vec.html (python) • GloVe
 http://nlp.stanford.edu/projects/glove/ (original)
 https://github.com/maciejkula/glove-python (python) • Socher et al / Stanford RNN Sentiment code:
 http://nlp.stanford.edu/sentiment/code.html • Deep Learning without Magic Tutorial:
 http://nlp.stanford.edu/courses/NAACL2013/ Wanna Play ? NLP 81
  82. 82. • cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/ CUDA, optimized for GTX 580) 
 https://code.google.com/p/cuda-convnet2/ • Caffe (Berkeley) (Cuda/OpenCL, Theano, Python)
 http://caffe.berkeleyvision.org/ • OverFeat (NYU) 
 http://cilvr.nyu.edu/doku.php?id=code:start Wanna Play ? Computer Vision 82
  83. 83. as PhD candidate KTH/CSC: “Always interested in discussing Machine Learning, Deep Architectures, Graphs, and Language Technology” In touch! roelof@kth.se www.csc.kth.se/~roelof/ Internship / EntrepeneurshipAcademic/Research as CIO/CTO Feeda: “Always looking for additions to our 
 brand new R&D team”
 
 [Internships upcoming on 
 KTH exjobb website…] roelof@feeda.com www.feeda.com Feeda 83
  84. 84. Were Hiring! roelof@feeda.com www.feeda.com Feeda • Dev Ops • Software Developers • Data Scientists 84

×