SlideShare une entreprise Scribd logo
1  sur  57
Action Recognition A general survey of previous works on SobhanNaderiParizi September 2009
List of papers Statistical Analysis of Dynamic Actions On Space-Time Interest Points Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words What, where and who? Classifying events by scene and object recognition Recognizing Actions at a Distance Recognizing Human Actions: A Local SVM Approach Retrieving Actions in Movies Learning Realistic Human Actions from Movies Actions in Context Selection and Context for Action Recognition
Non-parametric Distance Measure for Action Recognition Paper info: Title: Statistical Analysis of Dynamic Actions Authors: LihiZelnik-Manor Michal Irani TPAMI 2006 A preliminary version appeared in CVPR 2001 “Event-Based video Analysis”
“Statistical Analysis of Dynamic Actions” Overview: Introduce a non-parametric distance measure Video matching (no action model): given a reference video, similar sequences are found Dense features from multiple temporal scales (only corresponding scales are compared) Temporal extent of videos in each category should be the same! (a fast and slow dancing are different) New database is introduced Periodic activities (walk) Non-periodic activities (Punch, Kick, Duck, Tennis) Temporal Textures (water) www.wisdom.weizmann.ac.il/~vision/EventDetection.html
“Statistical Analysis of Dynamic Actions” Feature description: Space-time gradient of each pixel Threshold the gradient magnitudes Normalization (ignoring appearance) Absolute value (invariant to dark/light transitions) Direction invariant
“Statistical Analysis of Dynamic Actions” Comments: Actions are represented by 3L independent 1D distributions (L being number of temporal scales) The frames are blurred first Robust to change of appearance e.g. high textured clothing Action recognition/localization For a test video sequence S and a reference sequence of T frames: Each consequent sub-sequence of length T is compared to the reference In case of multiple reference videos: Mahalanobis distance
Space-Time Interest Points (STIP) Paper info: Title: On Space-Time Interest Points Authors: Ivan Laptev: INRIA / IRISA IJCV 2009
“On Space-Time Interest Points” Extends Harris detector to 3D (space-time) Local space-time points with non-constant motion: Points with accelerated motion: physical forces Independent space and time scales Automatic scale selection
“On Space-Time Interest Points” Automatic scale selection procedure: Detect interest points Move in the direction of optimal scale Repeat until locally optimal scale is reached (iterative) The procedure can not be used in real-time: Frames in future time are needed There exist estimation approaches to solve this problem
Unsupervised Action Recognition Paper info: Title: Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words Authors: Juan Carlos Niebles: University of Illinois Hongcheng Wang: University of Illinois Li Fei-Fei: University of Illinois BMVC 2006
“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” Generative graphical model (pLSA) STIP detector is used (piotrdollár et al.) Laptev’s STIP detector is too sparse Dictionary of video words is created The method is unsupervised Simultaneous action recognition/localization Evaluations on: KTH action database Skating actions database (4 action classes)
“Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” Overview of the method: ,[object Object]
d: video sequence
z: latent topic (action category),[object Object]
Event recognition in sport images Paper info: Title: What, where and who? Classifying events by scene and object recognition Authors: Li-Jia Li: University of Illinois Li Fei-Fei: Princeton University ICCV 2007
“What, where and who? Classifying events by scene and object recognition” Goal of the paper: Event classification in still images Scene labeling Object labeling Approach: Generative graphical model Assumes that objects and scenes are independent given the event category Ignores spatial relationships between objects
“What, where and who? Classifying events by scene and object recognition” Information channels: Scene context (holistic representation) Object appearance Geometrical layout (sky at infinity/vertical structure/ground plane) Feature extraction: 12x12 patches obtained by grid sampling (10x10) For each patch: SIFT feature (used both for scene and object models) Layout label (used only for object model)
“What, where and who? Classifying events by scene and object recognition” The graphical model E: event S: scene O: object X: scene feature A: appearance feature G: geometry layout
“What, where and who? Classifying events by scene and object recognition” A new database is compiled: 8 sport even categories (downloaded from web) Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbing Average classification  	accuracy over all 8  	event classes = 74.3%
“What, where and who? Classifying events by scene and object recognition” Sample results:
Action recognition in medium resolution regimes  Paper info: Title: Recognizing Actions at a Distance Authors: Alexei A. Efros: UC Berkeley Alexander C. Berg: UC Berkeley Greg Mori: UC Berkeley JitendraMalik: UC Berkeley ICCV 2003
“Recognizing Actions at a Distance” Overall review: Actions in medium resolution (30 pix tall) Proposing a new motion descriptor KNN for classification Consistent tracking bounding  	box of the actor is required Action recognition is done only  	on the tracking bounding box Motion in terms of as relative  	movement of body parts No info. about movements is given by the tracker
“Recognizing Actions at a Distance” Motion Feature: For each frame, a local temporal neighborhood is considered Optical flow is extracted (other alternatives: image pixel values, temporal gradients) OF is noisy:  half-wave rectifying + blurring To preserve motion info: OF vector is decomposed to its  	vertical/horizontal components
“Recognizing Actions at a Distance” Similarity measure: i,j: index of frame T: temporal extent I: spatial extent A: 1st video sequence  =  B: 2nd video sequence =
“Recognizing Actions at a Distance” New Dataset: Ballet (stationary camera): 16 action classes 2 men + 2 women Easy dataset (controlled environment) Tennis (real action, stationary camera): 6 action classes (stand, swing, move-left, …) different days/location/camera position 2 players (man + woman) Football (real action, moving camera): 8 action classes (run-left 45˚, run-left, walk-left, …) Zoom in/out
“Recognizing Actions at a Distance” Average classification accuracy: Ballet:        87.44% (5NN) Tennis:      64.33% (5NN) Football:  65.38% (1NN) What can be done?
“Recognizing Actions at a Distance” Applications: Do as I Do: Replace actors in videos Do as I Say: Develop real-world motions in computer games 2D/3D skeleton transfer Figure Correction: Remove occlusion/clutter in movies
KTH Action Dataset Paper info: Title: Recognizing Human Actions: A Local SVM Approach Authors: Christian Schuldt: KTH university Ivan Laptev: KTH university ICPR 2004
“Recognizing Human Actions: A Local SVM Approach” New dataset (KTH action database): 2391 video sequences 6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving) 25 persons Static camera 4 scenarios: Outdoors (s1) Outdoors + scale variation (s2): the hardest scenario Outdoors + cloth variation (s3) Indoors (s4)
“Recognizing Human Actions: A Local SVM Approach” Features: Sparse (STIP detector) Spatio-temporal jets of order 4 Different feature representations: Raw jet feature descriptors Exponential       kernel on the histogram of jets Spatial HoG with temporal pyramid Different classifiers: SVM NN
“Recognizing Human Actions: A Local SVM Approach” Experimental results: Local Feature (jets) + SVM performs the best SVM outperforms NN HistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients) Average classification accuracy on all scenarios = 71.72%
Action Recognition in Real Scenarios Paper info: Title: Retrieving Actions in Movies Authors: Ivan Laptev: INRIA / IRISA Patrik Perez: INRIA / IRISA ICCV 2007
“Retrieving Actions in Movies” A new action database from real movies Experiments only on Drinking action vs. random/Smoking Main contributions: Recognizing unrestricted real actions Key-frame priming Configuration of experiments: Action recognition (on pre-segmented seq.) Comparing different features Action detection (using key-frame priming)
“Retrieving Actions in Movies” Real movie action database: 105 drinking actions 141 smoking actions Different scenes/people/views www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html Action representation: R = (P, ΔP) P = (X, Y, T): space-time coordinates ΔP = (ΔX, ΔY, ΔT): ΔX: 1.6 width  of head bounding box ΔY: 1.3 height of head bounding box
“Retrieving Actions in Movies” Learning scheme: Discrete AdaBoost + FLD (Fisher Linear Discriminant) All action cuboids are normalized  	to 14x14x8 cells of 5x5x5 pixels 	(needed for boosting) Slightly temporal-randomized  	sequences is added to training HoG(4bins)/OF(5bins) is used Local features: Θ=(x,y,t, δx, δy, δt, β, Ψ) ΒЄ{plain, temp-2, spat-4} ΨЄ{OF5, Grad4}
“Retrieving Actions in Movies” HoG captures shape, OF captures motion Informative motions: start & end of action Key-frame: When hand reaches head Boosted-Histogram on HOG No motion info 	around key-frame Integration of 	motion & key-frame 	should help
“Retrieving Actions in Movies” Experiments: OF/OF+HoG/STIP+NN/only key-frame OF/OF+HoG works best on hard test (drinking vs. smoking) Extension of OF5 to OFGrad9 does not help! Key-frame priming: #FPs decreases significantly (different info. channels) Significant overall accuracy: It’s better to model motion and appearance separately Speed of key-primed version: 3 seconds per frame
“Retrieving Actions in Movies” Possible extensions: Extend the experiments to more action classes Make it real-time
Automatic Video Annotation Paper info: Title: Learning Realistic Human Actions from Movies Authors: Ivan Laptev: INRIA / IRISA MarcinMarszalek: INRIA / LEAR CordeliaSchmid: INRIA / LEAR Benjamin Rozenfeld: Bar-Ilan university CVPR 2008
“Learning Realistic Human Actions from Movies” Overview: Automatic movie annotation: Alignment of movie scripts Text classification Classification of real action Providing a new dataset Beat state-of-the-art results on KTH dataset Extending spatial pyramid to space-time pyramid
“Learning Realistic Human Actions from Movies” Movie script: Publicly available textual description about: Scene description Characters Transcribed dialogs Actions (descriptive) Limitations: No exact timing alignment No guarantee for correspondence with real actions Actions are expressed literally (diverse descriptions) Actions may be missed due to lack of conversation
“Learning Realistic Human Actions from Movies” Automatic annotation: Subtitles include exact time alignment Timing of scripts is matched by subtitles Textual description of action is done by a text classifier New dataset: 8 action classes (AnswerPhone, GetOutCar, SitUp, …) Two training sets (automatically/manually annotated) 60% of the automatic training set is correctly annotated http://www.irisa.fr/vista/actions
“Learning Realistic Human Actions from Movies” Action classification approach: BoF framework (k=4000) Space-time pyramids 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2} 4 temporal grids: {t1, t2, t3, ot2} STIP with multiple scales HoG and HoF
“Learning Realistic Human Actions from Movies” Feature extraction: A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9) The volume is divided to                                          grid HoG and HoF for each grid cell is calculated and concatenated together These concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid
“Learning Realistic Human Actions from Movies” Different channels: Each spatio-temporal template: one channel Greedy search to find the best channel combination Kernel function = Chi2 distance Observations: HoG performs better than HoF No temporal subdivision is preferred (temporal grid = t1) Combination of channels improves classification in real scenario Mean AP on KTH action database = 91.8% Mean AP on real movies database: Trained on manually annotated dataset : 39.5% Trained on automatically annotated dataset : 22.9% Random classifier (chance) : 12.5%
“Learning Realistic Human Actions from Movies” Future works: Increase robustness to annotation noise Improve script to video alignment Learn on larger database of automatic annotation Experiment more low-level features Move from BoF to detector based methods The table shows: effect of temporal division when combining channels (HMM based methods should work) Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent
Image Context in Action Recognition Paper info: Title: Actions in Context Authors: MarcinMarszalek: INRIA / LEAR Ivan Laptev: INRIA / IRISA CordeliaSchmid: INRIA / LEAR CVPR 2009
“Actions in Context” Contributions: Automatic learning of scene classes from video Improve action recognition using image context and vice versa Movie scripts is used for automatic training For both action and scene: BoF + SVM New large database: 12 action classes 69 movies involved 10 scene classes www.irisa.fr/vista/actions/hollywood2
“Actions in Context” For automatic annotation, scenes are identified only from text Features: SIFT (modeling scene) 	on 2D-Harris HoG and HoF (motion) 	on 3D-Harris (STIP)
“Actions in Context” Features: SIFT: extracted from 2D-Harris detector Captaures static appearance Used for modeling scene context Calculated for single frame (every 2 seconds) HoG/HoF: extracted from 3D-Harris detector HoG captures dynamic appearance HoF captures motion pattern One video dictionary per channel is created Histogram of video words is created for each channel Classifier: SVM using chi2 distance Exponential kernel (RBF) Sum over multiple channels
“Actions in Context” Evaluations: SIFT: better for context HoG/HoF: better for action Only context can also classify  	actions fairly good! Combination of the 3 channels 	works best
“Actions in Context” Observations: Context is not always good Idea: The model should control  	contribution of context for each  	action class individually  Overall, the gain of accuracy 	is not significant using context: Idea: other types of context should 	work better
Object Co-occurrence in Action Recognition  Paper info: Title: Selection and Context for Action Recognition Authors: Dong Han: University of Bonn Liefeng Bo: TTI-Chicago CristianSminchisescu: University of Bonn ICCV 2009
“Selection and Context for Action Recognition” Main contributions: Contextual scene descriptors based on: Presence/absence of objects (bag-of-detectors) Structural relation between objects and their parts Automatic learning of multiple features Multiple Kernel Gaussian Process Classifier (MKGPC) Experimental results on: KTH action dataset Hollywood1,2 Human Action database (INRIA)
“Selection and Context for Action Recognition” Main message: Detection of a Car and a Personin its proximity increases probability of Get-Out-Car action Provides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, … Similar to MKL but here Parameters are learnt automatically i.e. (weights + hyper-parameters)  Gaussian Process scheme is used for learning
“Selection and Context for Action Recognition” Descriptors: Bag of Detectors Deformable part models are used (Pedro) Once object BBs are detected, 3 descriptors are built: ObjPres (4D) ObjCount (4D) ObjDist (21D): pair-wise distances of object parts for all of Person detector (7 parts) HOG (4D) + HOF (5D) from STIP detector (Ivan) Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3 Temporal grids: t1, t2, t3 3D gradient features
“Selection and Context for Action Recognition” Experimental results: KTH dataset 94.1% mean AP vs. 91.8% reported by Laptev Superior to state-of-the-art in all but Running class HOHA1 dataset Trained on clean set only The optimal subset of features is found greedily (addition/removal) based on test error 47.5% mean AP vs. 38.4% reported by Laptev HOHA2 dataset 43.12% mean AP vs. 35.1% reported by Marszalek

Contenu connexe

Similaire à A general survey of previous works on action recognition

A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesCSCJournals
 
Talk 2011-buet-perception-event
Talk 2011-buet-perception-eventTalk 2011-buet-perception-event
Talk 2011-buet-perception-eventMahfuzul Haque
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptxcomputerscience98
 
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Wesley De Neve
 
Wang midterm-defence
Wang midterm-defenceWang midterm-defence
Wang midterm-defenceZhipeng Wang
 
On the Development of A Real-Time Multi-Sensor Activity Recognition System
On the Development of A Real-Time Multi-Sensor Activity Recognition SystemOn the Development of A Real-Time Multi-Sensor Activity Recognition System
On the Development of A Real-Time Multi-Sensor Activity Recognition SystemOresti Banos
 
20110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture0120110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture01Computer Science Club
 
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...Hirokatsu Kataoka
 
Review paper human activity analysis
Review paper human activity analysisReview paper human activity analysis
Review paper human activity analysisIftikhar Alam
 
Introduction to Motion Tracking to Dance
Introduction to Motion Tracking to  DanceIntroduction to Motion Tracking to  Dance
Introduction to Motion Tracking to DanceMarlon Solano
 
Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...butest
 
Lecture-1-CVIntroduction.pdf
Lecture-1-CVIntroduction.pdfLecture-1-CVIntroduction.pdf
Lecture-1-CVIntroduction.pdfTechEvents1
 
A Comparison of People Counting Techniques via Video Scene Analysis
A Comparison of People Counting Techniques viaVideo Scene AnalysisA Comparison of People Counting Techniques viaVideo Scene Analysis
A Comparison of People Counting Techniques via Video Scene AnalysisPoo Kuan Hoong
 
MVFI Meeting (January 14th, 2011)
MVFI Meeting (January 14th, 2011)MVFI Meeting (January 14th, 2011)
MVFI Meeting (January 14th, 2011)ivangomezconde
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of CrimeIRJET Journal
 

Similaire à A general survey of previous works on action recognition (20)

A Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal FeaturesA Framework for Human Action Detection via Extraction of Multimodal Features
A Framework for Human Action Detection via Extraction of Multimodal Features
 
unusualevent
unusualeventunusualevent
unusualevent
 
Talk 2011-buet-perception-event
Talk 2011-buet-perception-eventTalk 2011-buet-perception-event
Talk 2011-buet-perception-event
 
Action_recognition-topic.pptx
Action_recognition-topic.pptxAction_recognition-topic.pptx
Action_recognition-topic.pptx
 
Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...Sparse representation based human action recognition using an action region-a...
Sparse representation based human action recognition using an action region-a...
 
Wang midterm-defence
Wang midterm-defenceWang midterm-defence
Wang midterm-defence
 
On the Development of A Real-Time Multi-Sensor Activity Recognition System
On the Development of A Real-Time Multi-Sensor Activity Recognition SystemOn the Development of A Real-Time Multi-Sensor Activity Recognition System
On the Development of A Real-Time Multi-Sensor Activity Recognition System
 
20110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture0120110220 computer vision_eruhimov_lecture01
20110220 computer vision_eruhimov_lecture01
 
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...
【ISVC2015】Evaluation of Vision-based Human Activity Recognition in Dense Traj...
 
Presentation of Visual Tracking
Presentation of Visual TrackingPresentation of Visual Tracking
Presentation of Visual Tracking
 
Review paper human activity analysis
Review paper human activity analysisReview paper human activity analysis
Review paper human activity analysis
 
AAG_2011
AAG_2011AAG_2011
AAG_2011
 
Introduction to Motion Tracking to Dance
Introduction to Motion Tracking to  DanceIntroduction to Motion Tracking to  Dance
Introduction to Motion Tracking to Dance
 
Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...Announcing the Final Examination of Mr. Paul Smith for the ...
Announcing the Final Examination of Mr. Paul Smith for the ...
 
Lecture-1-CVIntroduction.pdf
Lecture-1-CVIntroduction.pdfLecture-1-CVIntroduction.pdf
Lecture-1-CVIntroduction.pdf
 
Motion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background SubtractionMotion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background Subtraction
 
A Comparison of People Counting Techniques via Video Scene Analysis
A Comparison of People Counting Techniques viaVideo Scene AnalysisA Comparison of People Counting Techniques viaVideo Scene Analysis
A Comparison of People Counting Techniques via Video Scene Analysis
 
MVFI Meeting (January 14th, 2011)
MVFI Meeting (January 14th, 2011)MVFI Meeting (January 14th, 2011)
MVFI Meeting (January 14th, 2011)
 
Kv3518641870
Kv3518641870Kv3518641870
Kv3518641870
 
IRJET- Survey on Detection of Crime
IRJET-  	  Survey on Detection of CrimeIRJET-  	  Survey on Detection of Crime
IRJET- Survey on Detection of Crime
 

Plus de zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

Plus de zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Dernier

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Dernier (20)

Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

A general survey of previous works on action recognition

  • 1. Action Recognition A general survey of previous works on SobhanNaderiParizi September 2009
  • 2. List of papers Statistical Analysis of Dynamic Actions On Space-Time Interest Points Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words What, where and who? Classifying events by scene and object recognition Recognizing Actions at a Distance Recognizing Human Actions: A Local SVM Approach Retrieving Actions in Movies Learning Realistic Human Actions from Movies Actions in Context Selection and Context for Action Recognition
  • 3. Non-parametric Distance Measure for Action Recognition Paper info: Title: Statistical Analysis of Dynamic Actions Authors: LihiZelnik-Manor Michal Irani TPAMI 2006 A preliminary version appeared in CVPR 2001 “Event-Based video Analysis”
  • 4. “Statistical Analysis of Dynamic Actions” Overview: Introduce a non-parametric distance measure Video matching (no action model): given a reference video, similar sequences are found Dense features from multiple temporal scales (only corresponding scales are compared) Temporal extent of videos in each category should be the same! (a fast and slow dancing are different) New database is introduced Periodic activities (walk) Non-periodic activities (Punch, Kick, Duck, Tennis) Temporal Textures (water) www.wisdom.weizmann.ac.il/~vision/EventDetection.html
  • 5. “Statistical Analysis of Dynamic Actions” Feature description: Space-time gradient of each pixel Threshold the gradient magnitudes Normalization (ignoring appearance) Absolute value (invariant to dark/light transitions) Direction invariant
  • 6. “Statistical Analysis of Dynamic Actions” Comments: Actions are represented by 3L independent 1D distributions (L being number of temporal scales) The frames are blurred first Robust to change of appearance e.g. high textured clothing Action recognition/localization For a test video sequence S and a reference sequence of T frames: Each consequent sub-sequence of length T is compared to the reference In case of multiple reference videos: Mahalanobis distance
  • 7. Space-Time Interest Points (STIP) Paper info: Title: On Space-Time Interest Points Authors: Ivan Laptev: INRIA / IRISA IJCV 2009
  • 8. “On Space-Time Interest Points” Extends Harris detector to 3D (space-time) Local space-time points with non-constant motion: Points with accelerated motion: physical forces Independent space and time scales Automatic scale selection
  • 9. “On Space-Time Interest Points” Automatic scale selection procedure: Detect interest points Move in the direction of optimal scale Repeat until locally optimal scale is reached (iterative) The procedure can not be used in real-time: Frames in future time are needed There exist estimation approaches to solve this problem
  • 10. Unsupervised Action Recognition Paper info: Title: Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words Authors: Juan Carlos Niebles: University of Illinois Hongcheng Wang: University of Illinois Li Fei-Fei: University of Illinois BMVC 2006
  • 11. “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words” Generative graphical model (pLSA) STIP detector is used (piotrdollár et al.) Laptev’s STIP detector is too sparse Dictionary of video words is created The method is unsupervised Simultaneous action recognition/localization Evaluations on: KTH action database Skating actions database (4 action classes)
  • 12.
  • 14.
  • 15. Event recognition in sport images Paper info: Title: What, where and who? Classifying events by scene and object recognition Authors: Li-Jia Li: University of Illinois Li Fei-Fei: Princeton University ICCV 2007
  • 16. “What, where and who? Classifying events by scene and object recognition” Goal of the paper: Event classification in still images Scene labeling Object labeling Approach: Generative graphical model Assumes that objects and scenes are independent given the event category Ignores spatial relationships between objects
  • 17. “What, where and who? Classifying events by scene and object recognition” Information channels: Scene context (holistic representation) Object appearance Geometrical layout (sky at infinity/vertical structure/ground plane) Feature extraction: 12x12 patches obtained by grid sampling (10x10) For each patch: SIFT feature (used both for scene and object models) Layout label (used only for object model)
  • 18. “What, where and who? Classifying events by scene and object recognition” The graphical model E: event S: scene O: object X: scene feature A: appearance feature G: geometry layout
  • 19. “What, where and who? Classifying events by scene and object recognition” A new database is compiled: 8 sport even categories (downloaded from web) Bocce, croquet, polo, rowing, snowboarding, badminton, sailing, rock climbing Average classification accuracy over all 8 event classes = 74.3%
  • 20. “What, where and who? Classifying events by scene and object recognition” Sample results:
  • 21. Action recognition in medium resolution regimes Paper info: Title: Recognizing Actions at a Distance Authors: Alexei A. Efros: UC Berkeley Alexander C. Berg: UC Berkeley Greg Mori: UC Berkeley JitendraMalik: UC Berkeley ICCV 2003
  • 22. “Recognizing Actions at a Distance” Overall review: Actions in medium resolution (30 pix tall) Proposing a new motion descriptor KNN for classification Consistent tracking bounding box of the actor is required Action recognition is done only on the tracking bounding box Motion in terms of as relative movement of body parts No info. about movements is given by the tracker
  • 23. “Recognizing Actions at a Distance” Motion Feature: For each frame, a local temporal neighborhood is considered Optical flow is extracted (other alternatives: image pixel values, temporal gradients) OF is noisy: half-wave rectifying + blurring To preserve motion info: OF vector is decomposed to its vertical/horizontal components
  • 24. “Recognizing Actions at a Distance” Similarity measure: i,j: index of frame T: temporal extent I: spatial extent A: 1st video sequence = B: 2nd video sequence =
  • 25. “Recognizing Actions at a Distance” New Dataset: Ballet (stationary camera): 16 action classes 2 men + 2 women Easy dataset (controlled environment) Tennis (real action, stationary camera): 6 action classes (stand, swing, move-left, …) different days/location/camera position 2 players (man + woman) Football (real action, moving camera): 8 action classes (run-left 45˚, run-left, walk-left, …) Zoom in/out
  • 26. “Recognizing Actions at a Distance” Average classification accuracy: Ballet: 87.44% (5NN) Tennis: 64.33% (5NN) Football: 65.38% (1NN) What can be done?
  • 27. “Recognizing Actions at a Distance” Applications: Do as I Do: Replace actors in videos Do as I Say: Develop real-world motions in computer games 2D/3D skeleton transfer Figure Correction: Remove occlusion/clutter in movies
  • 28. KTH Action Dataset Paper info: Title: Recognizing Human Actions: A Local SVM Approach Authors: Christian Schuldt: KTH university Ivan Laptev: KTH university ICPR 2004
  • 29. “Recognizing Human Actions: A Local SVM Approach” New dataset (KTH action database): 2391 video sequences 6 action classes (Walking, Jogging, Running, Handclapping, Boxing, Hand-waving) 25 persons Static camera 4 scenarios: Outdoors (s1) Outdoors + scale variation (s2): the hardest scenario Outdoors + cloth variation (s3) Indoors (s4)
  • 30. “Recognizing Human Actions: A Local SVM Approach” Features: Sparse (STIP detector) Spatio-temporal jets of order 4 Different feature representations: Raw jet feature descriptors Exponential kernel on the histogram of jets Spatial HoG with temporal pyramid Different classifiers: SVM NN
  • 31. “Recognizing Human Actions: A Local SVM Approach” Experimental results: Local Feature (jets) + SVM performs the best SVM outperforms NN HistLF (histogram of jets) is slightly better than HistSTG (histogram of spatio-temporal gradients) Average classification accuracy on all scenarios = 71.72%
  • 32. Action Recognition in Real Scenarios Paper info: Title: Retrieving Actions in Movies Authors: Ivan Laptev: INRIA / IRISA Patrik Perez: INRIA / IRISA ICCV 2007
  • 33. “Retrieving Actions in Movies” A new action database from real movies Experiments only on Drinking action vs. random/Smoking Main contributions: Recognizing unrestricted real actions Key-frame priming Configuration of experiments: Action recognition (on pre-segmented seq.) Comparing different features Action detection (using key-frame priming)
  • 34. “Retrieving Actions in Movies” Real movie action database: 105 drinking actions 141 smoking actions Different scenes/people/views www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html Action representation: R = (P, ΔP) P = (X, Y, T): space-time coordinates ΔP = (ΔX, ΔY, ΔT): ΔX: 1.6 width of head bounding box ΔY: 1.3 height of head bounding box
  • 35. “Retrieving Actions in Movies” Learning scheme: Discrete AdaBoost + FLD (Fisher Linear Discriminant) All action cuboids are normalized to 14x14x8 cells of 5x5x5 pixels (needed for boosting) Slightly temporal-randomized sequences is added to training HoG(4bins)/OF(5bins) is used Local features: Θ=(x,y,t, δx, δy, δt, β, Ψ) ΒЄ{plain, temp-2, spat-4} ΨЄ{OF5, Grad4}
  • 36. “Retrieving Actions in Movies” HoG captures shape, OF captures motion Informative motions: start & end of action Key-frame: When hand reaches head Boosted-Histogram on HOG No motion info around key-frame Integration of motion & key-frame should help
  • 37. “Retrieving Actions in Movies” Experiments: OF/OF+HoG/STIP+NN/only key-frame OF/OF+HoG works best on hard test (drinking vs. smoking) Extension of OF5 to OFGrad9 does not help! Key-frame priming: #FPs decreases significantly (different info. channels) Significant overall accuracy: It’s better to model motion and appearance separately Speed of key-primed version: 3 seconds per frame
  • 38. “Retrieving Actions in Movies” Possible extensions: Extend the experiments to more action classes Make it real-time
  • 39. Automatic Video Annotation Paper info: Title: Learning Realistic Human Actions from Movies Authors: Ivan Laptev: INRIA / IRISA MarcinMarszalek: INRIA / LEAR CordeliaSchmid: INRIA / LEAR Benjamin Rozenfeld: Bar-Ilan university CVPR 2008
  • 40. “Learning Realistic Human Actions from Movies” Overview: Automatic movie annotation: Alignment of movie scripts Text classification Classification of real action Providing a new dataset Beat state-of-the-art results on KTH dataset Extending spatial pyramid to space-time pyramid
  • 41. “Learning Realistic Human Actions from Movies” Movie script: Publicly available textual description about: Scene description Characters Transcribed dialogs Actions (descriptive) Limitations: No exact timing alignment No guarantee for correspondence with real actions Actions are expressed literally (diverse descriptions) Actions may be missed due to lack of conversation
  • 42. “Learning Realistic Human Actions from Movies” Automatic annotation: Subtitles include exact time alignment Timing of scripts is matched by subtitles Textual description of action is done by a text classifier New dataset: 8 action classes (AnswerPhone, GetOutCar, SitUp, …) Two training sets (automatically/manually annotated) 60% of the automatic training set is correctly annotated http://www.irisa.fr/vista/actions
  • 43. “Learning Realistic Human Actions from Movies” Action classification approach: BoF framework (k=4000) Space-time pyramids 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2} 4 temporal grids: {t1, t2, t3, ot2} STIP with multiple scales HoG and HoF
  • 44. “Learning Realistic Human Actions from Movies” Feature extraction: A volume of (2kσ x 2kσ x 2kτ) is taken around each STIP where σ/τ is spatial/temporal extent (k=9) The volume is divided to grid HoG and HoF for each grid cell is calculated and concatenated together These concatenated features are concatenated once more according to the pattern of spatio-temporal pyramid
  • 45. “Learning Realistic Human Actions from Movies” Different channels: Each spatio-temporal template: one channel Greedy search to find the best channel combination Kernel function = Chi2 distance Observations: HoG performs better than HoF No temporal subdivision is preferred (temporal grid = t1) Combination of channels improves classification in real scenario Mean AP on KTH action database = 91.8% Mean AP on real movies database: Trained on manually annotated dataset : 39.5% Trained on automatically annotated dataset : 22.9% Random classifier (chance) : 12.5%
  • 46. “Learning Realistic Human Actions from Movies” Future works: Increase robustness to annotation noise Improve script to video alignment Learn on larger database of automatic annotation Experiment more low-level features Move from BoF to detector based methods The table shows: effect of temporal division when combining channels (HMM based methods should work) Pattern of spatio-temporal pyramid changes so that context is best captured when the action is scene-dependent
  • 47. Image Context in Action Recognition Paper info: Title: Actions in Context Authors: MarcinMarszalek: INRIA / LEAR Ivan Laptev: INRIA / IRISA CordeliaSchmid: INRIA / LEAR CVPR 2009
  • 48. “Actions in Context” Contributions: Automatic learning of scene classes from video Improve action recognition using image context and vice versa Movie scripts is used for automatic training For both action and scene: BoF + SVM New large database: 12 action classes 69 movies involved 10 scene classes www.irisa.fr/vista/actions/hollywood2
  • 49. “Actions in Context” For automatic annotation, scenes are identified only from text Features: SIFT (modeling scene) on 2D-Harris HoG and HoF (motion) on 3D-Harris (STIP)
  • 50. “Actions in Context” Features: SIFT: extracted from 2D-Harris detector Captaures static appearance Used for modeling scene context Calculated for single frame (every 2 seconds) HoG/HoF: extracted from 3D-Harris detector HoG captures dynamic appearance HoF captures motion pattern One video dictionary per channel is created Histogram of video words is created for each channel Classifier: SVM using chi2 distance Exponential kernel (RBF) Sum over multiple channels
  • 51. “Actions in Context” Evaluations: SIFT: better for context HoG/HoF: better for action Only context can also classify actions fairly good! Combination of the 3 channels works best
  • 52. “Actions in Context” Observations: Context is not always good Idea: The model should control contribution of context for each action class individually Overall, the gain of accuracy is not significant using context: Idea: other types of context should work better
  • 53. Object Co-occurrence in Action Recognition Paper info: Title: Selection and Context for Action Recognition Authors: Dong Han: University of Bonn Liefeng Bo: TTI-Chicago CristianSminchisescu: University of Bonn ICCV 2009
  • 54. “Selection and Context for Action Recognition” Main contributions: Contextual scene descriptors based on: Presence/absence of objects (bag-of-detectors) Structural relation between objects and their parts Automatic learning of multiple features Multiple Kernel Gaussian Process Classifier (MKGPC) Experimental results on: KTH action dataset Hollywood1,2 Human Action database (INRIA)
  • 55. “Selection and Context for Action Recognition” Main message: Detection of a Car and a Personin its proximity increases probability of Get-Out-Car action Provides a framework to train a classifier based on combination of multiple features (not necessarily relevant) e.g. HOG+HOF+histogram intersection, … Similar to MKL but here Parameters are learnt automatically i.e. (weights + hyper-parameters) Gaussian Process scheme is used for learning
  • 56. “Selection and Context for Action Recognition” Descriptors: Bag of Detectors Deformable part models are used (Pedro) Once object BBs are detected, 3 descriptors are built: ObjPres (4D) ObjCount (4D) ObjDist (21D): pair-wise distances of object parts for all of Person detector (7 parts) HOG (4D) + HOF (5D) from STIP detector (Ivan) Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3 Temporal grids: t1, t2, t3 3D gradient features
  • 57. “Selection and Context for Action Recognition” Experimental results: KTH dataset 94.1% mean AP vs. 91.8% reported by Laptev Superior to state-of-the-art in all but Running class HOHA1 dataset Trained on clean set only The optimal subset of features is found greedily (addition/removal) based on test error 47.5% mean AP vs. 38.4% reported by Laptev HOHA2 dataset 43.12% mean AP vs. 35.1% reported by Marszalek
  • 58. “Selection and Context for Action Recognition” Best feature combination