SlideShare a Scribd company logo
1 of 46
Download to read offline
Video + Language
Jiebo Luo
Department of Computer Science
Introduction
• Video has become ubiquitous on the Internet, TV, as well as
personal devices.
• Recognition of video content has been a fundamental challenge
in computer vision for decades, where previous research
predominantly focused on understanding videos using a
predefined yet limited vocabulary.
• Thanks to the recent development of deep learning techniques,
researchers in both computer vision and multimedia
communities are now striving to bridge video with
natural language, which can be regarded as the ultimate goal of
video understanding.
• We present recent advances in exploring the synergy of video
understanding and language processing, including video-
language alignment, video captioning, and video emotion
analysis.
From Classification to Description
Recognizing Realistic Actions from Videos "in the Wild"
UCF-11 to UCF-101
(CVPR 2009)
Similarity btw Videos Cross-Domain Learning
Visual Event Recognition in Videos by
Learning from Web Data
(CVPR2010 Best Student Paper)
Heterogeneous Feature Machine For Visual Recognition
(ICCV 2009)
From Classification to Description
4
Semantic Video Entity Linking
(ICCV2015)
Exploring Coherent Motion Patterns
via Structured Trajectory Learning for
Crowd Mood Modeling
(IEEE T-CSVT 2016)
From Classification to Description
Aligning Language Descriptions
with Videos
Iftekhar Naim, Young Chol Song, Qiguang Liu
Jiebo Luo, Dan Gildea, Henry Kautz
link
Unsupervised Alignment of Actions
in Video with Text Descriptions
Y. Song, I. Naim, A. Mamun, K. Kulkarni, P. Singla
J. Luo, D. Gildea, H. Kautz
OverviewOverview
 Unsupervised alignment of video with text
 Motivations
 Generate labels from data (reduce burden of manual labeling)
 Learn new actions from only parallel video+text
 Extend noun/object matching to verbs and actions
 Unsupervised alignment of video with text
 Motivations
 Generate labels from data (reduce burden of manual labeling)
 Learn new actions from only parallel video+text
 Extend noun/object matching to verbs and actions
Matching Verbs to Actions
The person takes out a knife
and cutting board
Matching Nouns to Objects
[Naim et al., 2015]
An overview of the text and
video alignment framework
Hyperfeatures for ActionsHyperfeatures for Actions
 High-level features required for alignment with text
→ Motion features are generally low-level
 Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
 High-level features required for alignment with text
→ Motion features are generally low-level
 Hyperfeatures, originally used for image recognition extended
for use with motion features
→ Use temporal domain instead of spatial domain for vector
quantization (clustering)
Originally described in “Hyperfeatures:
Multilevel Local Coding for Visual Recog‐
nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
Hyperfeatures for ActionsHyperfeatures for Actions
 From low-level motion features, create high-level
representations that can easily align with verbs in text
 From low-level motion features, create high-level
representations that can easily align with verbs in text
Cluster 3
at time t
Accumulate over
frame at time t
& cluster
Conduct vector
quantization
of the histogram
at time t
Cluster 3, 5, …,5,20
= Hyperfeature 6
Each color code
is a vector
quantized
STIP point
Vector quantized
STIP point histogram at time t
Accumulate clusters over
window (t‐w/2, t+w/2]
and conduct vector
quantization
→ first‐level hyperfeatures
Align hyperfeatures
with verbs from text
(using LCRF)
Latent-variable CRF AlignmentLatent-variable CRF Alignment
 CRF where the latent variable is the alignment
 N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
 Xi,m represents nouns and verbs extracted from the mth sentence
 Yi,n represents blobs and actions in interval n in the video
 Conditional likelihood
 conditional probability of
 Learning weights w
 Stochastic gradient descent
 CRF where the latent variable is the alignment
 N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)
 Xi,m represents nouns and verbs extracted from the mth sentence
 Yi,n represents blobs and actions in interval n in the video
 Conditional likelihood
 conditional probability of
 Learning weights w
 Stochastic gradient descent
where feature function
More details in Naim et al. 2015 NAACL Paper ‐
Discriminative unsupervised alignment of natural language instructions with corresponding video segments
N
Experiments: Wetlab DatasetExperiments: Wetlab Dataset
 RGB-Depth video with lab protocols in text
 Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
 Small improvement over previous results
 Activities already highly correlated with object-use
 RGB-Depth video with lab protocols in text
 Compare addition of hyperfeatures generated from motion
features to previous results (Naim et al. 2015)
 Small improvement over previous results
 Activities already highly correlated with object-use
Detection of objects in 3D space
using color and point‐cloud
Previous results
using object/noun
alignment only
Addition of different types
of motion features
2DTraj: Dense trajectories
*Using hyperfeature window size w=150
Experiments: TACoS DatasetExperiments: TACoS Dataset
 RGB video with crowd-sourced text descriptions
 Activities such as “making a salad,” “baking a cake”
 No object recognition, alignment using actions only
 Uniform: Assume each sentence takes the same amount of time over the entire sequence
 Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
 Unsupervised LCRF: Both segmentation and alignment are unknown
 Effect of window size and number of clusters
 Consistent with average
action length: 150 frames
 RGB video with crowd-sourced text descriptions
 Activities such as “making a salad,” “baking a cake”
 No object recognition, alignment using actions only
 Uniform: Assume each sentence takes the same amount of time over the entire sequence
 Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels
 Unsupervised LCRF: Both segmentation and alignment are unknown
 Effect of window size and number of clusters
 Consistent with average
action length: 150 frames
*Using hyperfeature
window size w=150
*d(2)=64
Experiments: TACoS DatasetExperiments: TACoS Dataset
 Segmentation from a sequence in the dataset Segmentation from a sequence in the dataset
Crowd‐sourced descriptions
Example of text and video alignment generated
by the system on the TACoS corpus for sequence s13‐d28
Image Captioning with Semantic
Attention (CVPR 2016)
Quanzeng You, Jiebo Luo
Hailin Jin, Zhaowen Wang and Chen Fang
Image Captioning
• Motivations
– Real-world Usability
• Help visually impaired people, learning-impaired
– Improving Image Understanding
• Classification, Objection detection
– Image Retrieval
1. a young girl inhales with the intent of blowing
out a candle
2. girl blowing out the candle on an ice cream
1. A shot from behind home
plate of children playing
baseball
2. A group of children playing
baseball in the rain
3. Group of baseball players
playing on a wet field
Introduction of Image Captioning
• Machine learning as an approach to solve the problem
Overview
• Brief overview of current approaches
• Our main motivation
• The proposed semantic attention model
• Evaluation results
Brief Introduction of Recurrent Neural Network
• Different from CNN
11),(   ttttt BhAxhxfh
tt Chy 
• Unfolding over time
 Feedforward network
 Backpropagation through time
Inputs
Hidden Units
Outputs
xt ht-1
ht
yt
Inputs
Hidden Units
Outputs
C
yt
Inputs
Hidden Units
Inputs
Hidden Units
B
B
A
A
A B
t-1
t-2
Applications of Recurrent Neural Networks
• Machine Translation
• Reads input sentence “ABC” and produces “WXYZ”
Decoder RNNEncoder RNN
Encoder-Decoder Framework for Captioning
• Inspired by neural network based machine
translation
• Loss function



N
t
tt wwIwp
IwpL
1
10 ),,,|(log
)|(log

Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
Our Motivation
• Additional textual information
– Own noisy title, tags or captions (Web)
– Visually similar nearest neighbor images
– Success of low-level tasks
• Visual attributes detection
Image Captioning with Semantic Attention
• Big idea
First Idea
• Provide additional knowledge at each input node
• Concatenate the input word and the extra attributes K
• Each image has a fixed keyword list
)],,([),( 11   tktttt hbKWwfhxfh
Visual Features: 1024
GoogleNet
LSTM Hidden states: 512
Training details:
1. 256 image/sentence
pairs
2. RMS-Prob
Using Attributes along with Visual Features
• Provide additional knowledge at each input node
• Concatenate the visual embedding and keywords for h0
];[),( 10 bKWvWhvfh kiv  
Attention Model on Attributes
• Instead of using the same set of attributes at every
step
• At each step, select the attributes (attention)
 m mtmt kKwatt ),(
)softmax VK(wT
tt 
))],,(;([),( 11   tttttt hKwattxfhxfh
Overall Framework
• Training with a bilinear/bilateral attention model
ht
pt
xt
v
{Ai}
Yt~
RNN
Image
CNN
AttrDet 1
AttrDet 2
AttrDet 3
AttrDet N

t = 0
Word
Visual Attributes
• A secondary contribution
• We try different approaches
Performance
• Examples showing the impact of visual attributes on captions
Performance on the Testing Dataset
• Publicly available split
Performance
• MS-COCO Image Captioning Challenge
TGIF: A New Dataset and Benchmark on
Animated GIF Description
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault,
Larry Goldberg, Jiebo Luo
Overview
Comparison with Existing Datasets
Examples
a skate boarder is
doing trick on his skate
board.
a gloved hand opens to
reveal a golden ring.
a sport car is swinging on
the race playground
the vehicle is moving fast
into the tunnel
Contributions
• A large scale animated GIF description dataset for promoting image
sequence modeling and research
• Performing automatic validation to collect natural language descriptions
from crowd workers
• Establishing baseline image captioning methods for future
benchmarking
• Comparison with existing datasets, highlighting the benefits with
animated GIFs
In Comparison with Existing Datasets
• The language in our dataset is closer to common language
• Our dataset has an emphasis on the verbs
• Animated GIFs are more coherent and self contained
• Our dataset can be used to solve more difficult movie description
problem
Machine Generated Sentence Examples
Machine Generated Sentence Examples
Machine Generated Sentence Examples
Comparing Professionals and Crowd-workers
Crowd worker: two people are kissing on a boat.
Professional: someone glances at a kissing couple
then steps to a railing overlooking the ocean an
older man and woman stand beside him.
Crowd worker: two men got into their car
and not able to go anywhere because the
wheels were locked.
Professional: someone slides over the
camaros hood then gets in with his partner
he starts the engine the revving vintage
car starts to backup then lurches to a halt.
Crowd worker: a man in a shirt and tie sits beside a person who is
covered in a sheet.
Professional: he makes eye contact with the woman for only a second.
More: http://beta-
web2.cloudapp.net/ls
mdc_sentence_comp
arison.html
Movie Descriptions versus TGIF
• Crowd workers are encouraged to describe the major visual content
directly, and not to use overly descriptive language
• Because our animated GIFs are presented to crowd workers without
any context, the sentences in our dataset are more self-contained
• Animated GIFs are perfectly segmented since they are carefully
curated by online users to create a coherent visual story
Code & Dataset
• Yahoo! webscope (Coming soon!)
• Animated GIFs and sentences
• Code and models for LSTM baseline
• Pipeline for syntactic and semantic validation to collect natural
languages from crowd workers
Thanks
Q & A
Visual Intelligence & Social Multimedia Analytics
Google
***
Baidu
Sogou
Bing
XiaoIce
How Intelligent Are the AI Systems Today?

More Related Content

What's hot

Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye viewRoelof Pieters
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksMark Scully
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferRoelof Pieters
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooChristian Perone
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerPoo Kuan Hoong
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningRenārs Liepiņš
 
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Universitat Politècnica de Catalunya
 
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appPAY2 YOU
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question AnsweringSujit Pal
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature surveyAkshay Hegde
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep LearningPoo Kuan Hoong
 
101: Convolutional Neural Networks
101: Convolutional Neural Networks 101: Convolutional Neural Networks
101: Convolutional Neural Networks Mad Scientists
 

What's hot (19)

Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Deep Learning and Reinforcement Learning
Deep Learning and Reinforcement LearningDeep Learning and Reinforcement Learning
Deep Learning and Reinforcement Learning
 
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
 
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
 
Deep Learning Models for Question Answering
Deep Learning Models for Question AnsweringDeep Learning Models for Question Answering
Deep Learning Models for Question Answering
 
Deep Learning - A Literature survey
Deep Learning - A Literature surveyDeep Learning - A Literature survey
Deep Learning - A Literature survey
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
101: Convolutional Neural Networks
101: Convolutional Neural Networks 101: Convolutional Neural Networks
101: Convolutional Neural Networks
 

Similar to Video+Language: From Classification to Description

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...AI Frontiers
 
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Symeon Papadopoulos
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...Wee Hyong Tok
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videosabidhavp
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobileAnirudh Koul
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptxcomputerscience98
 
Deep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendationsDeep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendationsAryan Khandal
 
Tagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationTagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationEditor IJCATR
 
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...multimediaeval
 
DSC NTUE Info Session
DSC NTUE Info SessionDSC NTUE Info Session
DSC NTUE Info Sessionssusera8eac9
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Content Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachContent Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachCSCJournals
 
Quoc le, slides MLconf 11/15/13
Quoc le, slides  MLconf 11/15/13Quoc le, slides  MLconf 11/15/13
Quoc le, slides MLconf 11/15/13MLconf
 

Similar to Video+Language: From Classification to Description (20)

Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube an...
 
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...How to use transfer learning to bootstrap image classification and question a...
How to use transfer learning to bootstrap image classification and question a...
 
On the Influence Propagation of Web Videos
On the Influence Propagation of Web VideosOn the Influence Propagation of Web Videos
On the Influence Propagation of Web Videos
 
Video google-text (2)
Video google-text (2)Video google-text (2)
Video google-text (2)
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
Deep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendationsDeep neural networks for Youtube recommendations
Deep neural networks for Youtube recommendations
 
Tagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event CategorizationTagging based Efficient Web Video Event Categorization
Tagging based Efficient Web Video Event Categorization
 
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
MediaEval 2017 - Interestingness Task: Multimodality and Deep Learning when p...
 
DSC NTUE Info Session
DSC NTUE Info SessionDSC NTUE Info Session
DSC NTUE Info Session
 
Ai use cases
Ai use casesAi use cases
Ai use cases
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Content Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional ApproachContent Modelling for Human Action Detection via Multidimensional Approach
Content Modelling for Human Action Detection via Multidimensional Approach
 
Quoc le, slides MLconf 11/15/13
Quoc le, slides  MLconf 11/15/13Quoc le, slides  MLconf 11/15/13
Quoc le, slides MLconf 11/15/13
 
Deep Neural Networks (DNN)
Deep Neural Networks (DNN)Deep Neural Networks (DNN)
Deep Neural Networks (DNN)
 

More from Goergen Institute for Data Science

More from Goergen Institute for Data Science (10)

Vision and Language: Past, Present and Future
Vision and Language: Past, Present and FutureVision and Language: Past, Present and Future
Vision and Language: Past, Present and Future
 
Learning with Unpaired Data
Learning with Unpaired DataLearning with Unpaired Data
Learning with Unpaired Data
 
How to properly review AI papers?
How to properly review AI papers?How to properly review AI papers?
How to properly review AI papers?
 
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
When E-commerce Meets Social Media: Identifying Business on WeChat Moment Usi...
 
Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?
 
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
A Selfie is Worth a Thousand Words: Mining Personal Patterns behind User Self...
 
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
When Fashion Meets Big Data: Discriminative Mining of Best Selling Clothing F...
 
Big Data Better Life
Big Data Better LifeBig Data Better Life
Big Data Better Life
 
Social Multimedia as Sensors
Social Multimedia as SensorsSocial Multimedia as Sensors
Social Multimedia as Sensors
 
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
Forever Young: A Tribute to the Grandmaster through a recount of Personal Jou...
 

Recently uploaded

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 

Recently uploaded (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Video+Language: From Classification to Description

  • 1. Video + Language Jiebo Luo Department of Computer Science
  • 2. Introduction • Video has become ubiquitous on the Internet, TV, as well as personal devices. • Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. • Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge video with natural language, which can be regarded as the ultimate goal of video understanding. • We present recent advances in exploring the synergy of video understanding and language processing, including video- language alignment, video captioning, and video emotion analysis.
  • 3. From Classification to Description Recognizing Realistic Actions from Videos "in the Wild" UCF-11 to UCF-101 (CVPR 2009) Similarity btw Videos Cross-Domain Learning Visual Event Recognition in Videos by Learning from Web Data (CVPR2010 Best Student Paper) Heterogeneous Feature Machine For Visual Recognition (ICCV 2009)
  • 4. From Classification to Description 4 Semantic Video Entity Linking (ICCV2015)
  • 5. Exploring Coherent Motion Patterns via Structured Trajectory Learning for Crowd Mood Modeling (IEEE T-CSVT 2016) From Classification to Description
  • 6. Aligning Language Descriptions with Videos Iftekhar Naim, Young Chol Song, Qiguang Liu Jiebo Luo, Dan Gildea, Henry Kautz link
  • 7. Unsupervised Alignment of Actions in Video with Text Descriptions Y. Song, I. Naim, A. Mamun, K. Kulkarni, P. Singla J. Luo, D. Gildea, H. Kautz
  • 8. OverviewOverview  Unsupervised alignment of video with text  Motivations  Generate labels from data (reduce burden of manual labeling)  Learn new actions from only parallel video+text  Extend noun/object matching to verbs and actions  Unsupervised alignment of video with text  Motivations  Generate labels from data (reduce burden of manual labeling)  Learn new actions from only parallel video+text  Extend noun/object matching to verbs and actions Matching Verbs to Actions The person takes out a knife and cutting board Matching Nouns to Objects [Naim et al., 2015] An overview of the text and video alignment framework
  • 9. Hyperfeatures for ActionsHyperfeatures for Actions  High-level features required for alignment with text → Motion features are generally low-level  Hyperfeatures, originally used for image recognition extended for use with motion features → Use temporal domain instead of spatial domain for vector quantization (clustering)  High-level features required for alignment with text → Motion features are generally low-level  Hyperfeatures, originally used for image recognition extended for use with motion features → Use temporal domain instead of spatial domain for vector quantization (clustering) Originally described in “Hyperfeatures: Multilevel Local Coding for Visual Recog‐ nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions
  • 10. Hyperfeatures for ActionsHyperfeatures for Actions  From low-level motion features, create high-level representations that can easily align with verbs in text  From low-level motion features, create high-level representations that can easily align with verbs in text Cluster 3 at time t Accumulate over frame at time t & cluster Conduct vector quantization of the histogram at time t Cluster 3, 5, …,5,20 = Hyperfeature 6 Each color code is a vector quantized STIP point Vector quantized STIP point histogram at time t Accumulate clusters over window (t‐w/2, t+w/2] and conduct vector quantization → first‐level hyperfeatures Align hyperfeatures with verbs from text (using LCRF)
  • 11. Latent-variable CRF AlignmentLatent-variable CRF Alignment  CRF where the latent variable is the alignment  N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)  Xi,m represents nouns and verbs extracted from the mth sentence  Yi,n represents blobs and actions in interval n in the video  Conditional likelihood  conditional probability of  Learning weights w  Stochastic gradient descent  CRF where the latent variable is the alignment  N pairs of video/text observations {(xi, yi)}i=1 (indexed by i)  Xi,m represents nouns and verbs extracted from the mth sentence  Yi,n represents blobs and actions in interval n in the video  Conditional likelihood  conditional probability of  Learning weights w  Stochastic gradient descent where feature function More details in Naim et al. 2015 NAACL Paper ‐ Discriminative unsupervised alignment of natural language instructions with corresponding video segments N
  • 12. Experiments: Wetlab DatasetExperiments: Wetlab Dataset  RGB-Depth video with lab protocols in text  Compare addition of hyperfeatures generated from motion features to previous results (Naim et al. 2015)  Small improvement over previous results  Activities already highly correlated with object-use  RGB-Depth video with lab protocols in text  Compare addition of hyperfeatures generated from motion features to previous results (Naim et al. 2015)  Small improvement over previous results  Activities already highly correlated with object-use Detection of objects in 3D space using color and point‐cloud Previous results using object/noun alignment only Addition of different types of motion features 2DTraj: Dense trajectories *Using hyperfeature window size w=150
  • 13. Experiments: TACoS DatasetExperiments: TACoS Dataset  RGB video with crowd-sourced text descriptions  Activities such as “making a salad,” “baking a cake”  No object recognition, alignment using actions only  Uniform: Assume each sentence takes the same amount of time over the entire sequence  Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels  Unsupervised LCRF: Both segmentation and alignment are unknown  Effect of window size and number of clusters  Consistent with average action length: 150 frames  RGB video with crowd-sourced text descriptions  Activities such as “making a salad,” “baking a cake”  No object recognition, alignment using actions only  Uniform: Assume each sentence takes the same amount of time over the entire sequence  Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels  Unsupervised LCRF: Both segmentation and alignment are unknown  Effect of window size and number of clusters  Consistent with average action length: 150 frames *Using hyperfeature window size w=150 *d(2)=64
  • 14. Experiments: TACoS DatasetExperiments: TACoS Dataset  Segmentation from a sequence in the dataset Segmentation from a sequence in the dataset Crowd‐sourced descriptions Example of text and video alignment generated by the system on the TACoS corpus for sequence s13‐d28
  • 15. Image Captioning with Semantic Attention (CVPR 2016) Quanzeng You, Jiebo Luo Hailin Jin, Zhaowen Wang and Chen Fang
  • 16. Image Captioning • Motivations – Real-world Usability • Help visually impaired people, learning-impaired – Improving Image Understanding • Classification, Objection detection – Image Retrieval 1. a young girl inhales with the intent of blowing out a candle 2. girl blowing out the candle on an ice cream 1. A shot from behind home plate of children playing baseball 2. A group of children playing baseball in the rain 3. Group of baseball players playing on a wet field
  • 17. Introduction of Image Captioning • Machine learning as an approach to solve the problem
  • 18. Overview • Brief overview of current approaches • Our main motivation • The proposed semantic attention model • Evaluation results
  • 19. Brief Introduction of Recurrent Neural Network • Different from CNN 11),(   ttttt BhAxhxfh tt Chy  • Unfolding over time  Feedforward network  Backpropagation through time Inputs Hidden Units Outputs xt ht-1 ht yt Inputs Hidden Units Outputs C yt Inputs Hidden Units Inputs Hidden Units B B A A A B t-1 t-2
  • 20. Applications of Recurrent Neural Networks • Machine Translation • Reads input sentence “ABC” and produces “WXYZ” Decoder RNNEncoder RNN
  • 21. Encoder-Decoder Framework for Captioning • Inspired by neural network based machine translation • Loss function    N t tt wwIwp IwpL 1 10 ),,,|(log )|(log 
  • 22. Our Motivation • Additional textual information – Own noisy title, tags or captions (Web)
  • 23. Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) – Visually similar nearest neighbor images
  • 24. Our Motivation • Additional textual information – Own noisy title, tags or captions (Web) – Visually similar nearest neighbor images – Success of low-level tasks • Visual attributes detection
  • 25. Image Captioning with Semantic Attention • Big idea
  • 26. First Idea • Provide additional knowledge at each input node • Concatenate the input word and the extra attributes K • Each image has a fixed keyword list )],,([),( 11   tktttt hbKWwfhxfh Visual Features: 1024 GoogleNet LSTM Hidden states: 512 Training details: 1. 256 image/sentence pairs 2. RMS-Prob
  • 27. Using Attributes along with Visual Features • Provide additional knowledge at each input node • Concatenate the visual embedding and keywords for h0 ];[),( 10 bKWvWhvfh kiv  
  • 28. Attention Model on Attributes • Instead of using the same set of attributes at every step • At each step, select the attributes (attention)  m mtmt kKwatt ),( )softmax VK(wT tt  ))],,(;([),( 11   tttttt hKwattxfhxfh
  • 29. Overall Framework • Training with a bilinear/bilateral attention model ht pt xt v {Ai} Yt~ RNN Image CNN AttrDet 1 AttrDet 2 AttrDet 3 AttrDet N  t = 0 Word
  • 30. Visual Attributes • A secondary contribution • We try different approaches
  • 31. Performance • Examples showing the impact of visual attributes on captions
  • 32. Performance on the Testing Dataset • Publicly available split
  • 33. Performance • MS-COCO Image Captioning Challenge
  • 34. TGIF: A New Dataset and Benchmark on Animated GIF Description Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Jiebo Luo
  • 37. Examples a skate boarder is doing trick on his skate board. a gloved hand opens to reveal a golden ring. a sport car is swinging on the race playground the vehicle is moving fast into the tunnel
  • 38. Contributions • A large scale animated GIF description dataset for promoting image sequence modeling and research • Performing automatic validation to collect natural language descriptions from crowd workers • Establishing baseline image captioning methods for future benchmarking • Comparison with existing datasets, highlighting the benefits with animated GIFs
  • 39. In Comparison with Existing Datasets • The language in our dataset is closer to common language • Our dataset has an emphasis on the verbs • Animated GIFs are more coherent and self contained • Our dataset can be used to solve more difficult movie description problem
  • 43. Comparing Professionals and Crowd-workers Crowd worker: two people are kissing on a boat. Professional: someone glances at a kissing couple then steps to a railing overlooking the ocean an older man and woman stand beside him. Crowd worker: two men got into their car and not able to go anywhere because the wheels were locked. Professional: someone slides over the camaros hood then gets in with his partner he starts the engine the revving vintage car starts to backup then lurches to a halt. Crowd worker: a man in a shirt and tie sits beside a person who is covered in a sheet. Professional: he makes eye contact with the woman for only a second. More: http://beta- web2.cloudapp.net/ls mdc_sentence_comp arison.html
  • 44. Movie Descriptions versus TGIF • Crowd workers are encouraged to describe the major visual content directly, and not to use overly descriptive language • Because our animated GIFs are presented to crowd workers without any context, the sentences in our dataset are more self-contained • Animated GIFs are perfectly segmented since they are carefully curated by online users to create a coherent visual story
  • 45. Code & Dataset • Yahoo! webscope (Coming soon!) • Animated GIFs and sentences • Code and models for LSTM baseline • Pipeline for syntactic and semantic validation to collect natural languages from crowd workers
  • 46. Thanks Q & A Visual Intelligence & Social Multimedia Analytics Google *** Baidu Sogou Bing XiaoIce How Intelligent Are the AI Systems Today?