Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Lip reading Project

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 11 Publicité

Plus De Contenu Connexe

Similaire à Lip reading Project (20)

Plus récents (20)

Publicité

Lip reading Project

  1. 1. SECOND DEFENSE VIDEO CAPTIONING AND LIP READING
  2. 2. WORK PROPOSED FOR MAJOR PROJECT Objectives  Train a neural network using LSTM, RNNs and transfer learning for object detection (lip movement in this case) and linking the same with Natural Language Processing  Create a powerful tool capable of detecting the objects and describe the events of the video  If a human face and lip movement is detected, use AI techniques to read the lips and convert to text what’s being said Application  Better search algorithms : If each video can be automatically described search algorithms will have finer more accurate results  Recommendation Systems: We could easily be able to cluster videos based on their similarity if the contents of the video can be automatically described.  Automated lipreading of speakers with damaged vocal tracts, biometric person identification, multi-talker simultaneous speech decoding , etc.
  3. 3. METHODOLOGY  The project follows a three-step detection mechanism and neural networks are used at every stage. Video converted into image frames Detection of human lips Lip Reading Description of video contents Caption Generation YES N o 1 2 3
  4. 4. LIP MOVEMENT DETECTION • A simple RNN based detector that determines whether someone is speaking by watching their lip movements for 1 second of video (i.e. a sequence of 25 video frames). The detector can be run in real time on a video file, or on the output of a webcam by using a sliding window technique. • This model contains: • Two stacked RNN layers. • Each layer is composed of 64 non- bidirectional, simple RNN cells. • There is a dropout of 0.5 applied to the output of the second RNN layer before the output is finally fed to the final softmax classification layer. • Dataset that can be used: GRID, AMFED, DISFA, HMDB, Cohn-Kanade Reference
  5. 5. VIDEO CAPTIONING • Dataset that can be used: MSVD • This data set contains 1450 short YouTube clips that have been manually labelled for training and 100 videos for testing. • Each video has been assigned a unique ID and each ID has about 15–20 captions. • Model Used for feature extraction : VGG 16 (because of less training parameters) Reference
  6. 6. LIP READING • Dataset used: GRID CORPUS • GRID is a large multitalker audio-visual sentence corpus to support joint studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9 now". • A sequence of T frames is used as input, and is processed by 3 layers of STCNN, each followed by a spatial max-pooling layer. The features extracted are processed by 2 Bi-GRUs; each time-step of the GRU output is processed by a linear layer and a SoftMax. This end-to- end model is trained with CTC. LipNet architecture
  7. 7. INPUT VIDEO
  8. 8. LIP READING APPLICATION
  9. 9. INPUT VIDEO
  10. 10. VIDEO CAPTIONING APPLICATION
  11. 11. THANK YOU YASHIKA CHUGH 40214803118

×