Paper presented at the 6th International Work-Conference on Ambient Assisted Living.
Abstract: Due to the increasing demand of multi-camera setup and long-term monitoring in vision applications, real-time multi-view action recognition has gain a great interest in recent years. In this paper, we propose a multiple kernel learning based fusion framework that employs a motion-based person detector for finding regions of interest and local descriptors with bag-of-words quantisation for feature representation. The experimental results on a multi-view action dataset suggest that the proposed framework significantly outperforms simple fusion techniques and state-of-the-art methods.
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition
1. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
A MKL Based Fusion Framework for Real-Time
Multi-View Action Recognition
Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso and
Paolo Remagnino
Digital Imaging Research Centre
Kingston University, London, UK
December 3rd, 2014
1 / 22
2. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
1 Introduction
2 Framework Overview
3 Experimental Conditions
4 Results and Analysis
5 Conclusions and Future Work
2 / 22
3. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Background and Motivations
Real-time multi-view action recognition:
Gain an increasing interest in video surveillance, human
computer interaction, and multimedia retrieval etc.
Provide complementary
4. eld of views (FOVs) of a monitored
scene via multiple cameras
Lead to a more robust decision making based on multiple
heterogeneous video streams
Real-time capacity enables continuous long-term monitoring
If possible multiple cameras should be deployed to monitor
human behaviour, where data fusion techniques can be applied.
3 / 22
5. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Illustration of the Monitored Scenario
C4
C1
C2
C3
4 / 22
6. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Motion-Based Person Detector
We use a state-of-the-art motion-based tracker [6]:
Each pixel modelled as a mixture of Gaussians in RGB space
Background model to
7. nd foreground pixels in a new frame
Found foreground pixels grouped to form large regions
associated the person of interest
Kalman
8. lters used to track foreground detections
Person detections generated for every frame
5 / 22
9. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Feature Representation of Videos
Use of STIP and improved dense trajectories (IDT) [7] as
local descriptor to extract visual features from a video
Person detections and frame spans to de
10. ne a XYT cuboid
associated with an action performed by the monitored person
Apply bag of words (BOWs) to compute the feature vector of
a cuboid, where K-Means clustering used for the generation
of a codebook
6 / 22
11. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Disciminative Models for Classi
12. cation
Let xki
2 RD, where i 2 f1; 2; : : : ;Ng is the index of a feature
vector corresponding to a XYT cuboid and k 2 f1; 2; : : : ;Kg is the
index of a camera view. We learn a SVM classi
13. er as
f (x) =
XN
i=1
i yik(xi ; x) + b (1)
We then compute a classi
14. cation score via a sigmoid function as
p(y = 1jx) =
1
1 + exp(f (x))
(2)
7 / 22
15. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Simple Fusion Strategies
Ki1i
Concantenation of Features: concatenate the feature
vectors of multiple views into one single feature vector such
that ~xi = [x; : : : ; x]
Sum of Classi
19. cation scores of all the camera views as K
k=1 p(y = 1jxk )
8 / 22
20. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Multiple Kernel Learning
Combine of multiple kernels corresponding to dierent data
sources (e.g. camera views) via a convex function such as
K(xi ; xj ) =
XK
k=1
23. k = 1 and each kernel kk only uses a
distinct set of features from a data source.
9 / 22
24. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Two-Step Optimisation
We need to learn the kernel parameters weights (k ) and bias
(bk ) of a SVM model, and the combination parameters
25. k in 3.
This can be solved as follows:
Step-1: optimise over the kernel parameters k and bk
while
29. xing the kernel parameters k and bk (gradient decent)
Alternates between two steps iteratively until the system
converges to an optimal solution
10 / 22
30. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
IXMAS Multi-View Dataset
Created for view-invariant human action recognition [8]
Include 13 daily actions, each of which performed 3 times by
12 actors
Video sequences collected via 5 cameras, at 23 frames per
second and 390 291 resolution
We use all 12 actors and 5 cameras and evaluate 11 actions
as in [9]
Leave-one-subject-out cross validation used in the
experiments
11 / 22
31. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Implementation Details
A codebook sized 4000 quantised from 100000 randomly
selected descriptor features of the training set
STIP descriptor uses the entire image plane and the frame
span of an action given in the ground truth to de
32. ne a cuboid
IDT descriptor relies on the person detections in addition to
the frame span
All the SVM models use `1 normalisation and the 2 kernel
12 / 22
33. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Person Detection Results
cam0 cam1 cam2 cam3 cam4
Figure: Detection results of the motion-based tracker of the
34. rst run of
the subject `Alba', for all the camera views.
13 / 22
35. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Results of STIP (Internal Comparison)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
check watch
sit down
cross arms
scratch head
get up
turn around
walk
wave
punch
kick
pick up
SVM−COM
SVM−SUM
SVM−PRD
SVM−MKL
Figure: Class-wise mean recognition rates of all the folds of the compared
methods using STIP descriptor, where SVMCOM = 0:819,
SVMSUM = 0:820, SVMPRD = 0:815, and SVMMKL = 0:842.
14 / 22
36. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Results of IDT (Internal Comparison)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
check watch
sit down
cross arms
scratch head
get up
turn around
walk
wave
punch
kick
pick up
SVM−COM
SVM−SUM
SVM−PRD
SVM−MKL
Figure: Class-wise mean recognition rates of all the folds of the compared
methods using IDT descriptor, where SVMCOM = 0:915,
SVMSUM = 0:927, SVMPRD = 0:921, and SVMMKL = 0:950.
15 / 22
37. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Comparison with State-of-the-Art (External Comparison)
Method Actions Actors Views Rate FPS
Cilla et al. [3] 11 12 5 0.913 N/A
Weiland et al. [10] 11 10 5 0.933 N/A
Cilla et al. [4] 11 10 5 0.940 N/A
Holte et al. [5] 13 12 5 1.000 N/A
Weinland et al. [9] 11 10 5 0.835 500
Chaaraoui et al. [1] 11 12 5 0.859 26
Chaaraoui et al. [2] 11 12 5 0.914 207
SVM-MKL (IDT+BOWs) 11 12 5 0.950 25
Table: Comparison of the proposed MKL method using (IDT) descriptor
and BOWs, where the methods with `N/A' in the FPS column are oine.
16 / 22
38. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Conlusions and Future Work
Proposed MKL based framework outperforms the simple
fusion techniques, and the state-of-the-art methods
IDT descriptor superior than STIP descriptor for feature
representation in action recognition
The proposed framework capable of performing real-time
action recognition at 25 frames per second
For the future, apply to other similar vision problems, and study
alternative feature representation and fusion techniques.
17 / 22
39. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Thank you very much! Any questions?
18 / 22
40. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta.
Silhouette-based human action recognition using sequences of
key poses.
Pattern Recognition Letters, 34:1799{1807, 2013.
A. A. Chaaraoui, J. R. Padilla-Lopez, F. J. Ferrandez-Pastor,
M. Nieto-Hidalgo, and F. Florez-Revuelta.
A vision-based system for intelligent monitoring: human
behaviour analysis and privacy by context.
Sensors, 14:8895{8925, 2014.
R. Cilla, M. A. Patricio, and A. Berlanga.
A probabilistic, discriminative and distributed system for the
recognition of human actions from multiple views.
Neurocomputing, 75:78{87, 2012.
R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina.
19 / 22
41. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Human action recognition with sparse classi
42. cation and
multiple-view learning.
Expert Systems, DOI: 10.1111/exsy.12040, 2013.
M. Holte, B. Chakraborty, J. Gonzalez, and T. Moeslund.
A local 3-D motion descriptor for mult-view human action
recognition from 4-D spatio-temporal interest points.
IEEE Journal of Selected Topics in Signal Processing,
6:553{565, 2012.
C. Stauer and W. Grimson.
Learning patterns of activity using real time tracking.
IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), 22(8):747{767, 2000.
H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.
20 / 22
43. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Evaluation of local spatio-temporal features for action
recognition.
In British Machine Vision Conference (BMVC), 2009.
D. Weinland, E. Boyer, and R. Ronfard.
Action recognition from arbitrary views using 3d exemplars.
In IEEE International Conference on Computer Vision (ICCV),
pages 1{7, 2007.
D. Weinland, M. Ozuysal, and P. Fua.
Making action recognition robust to occlusions and viewpoint
changes.
In European Conference on Computer Vision, 2010.
D. Weinland, R. Ronfard, and E. Boyer.
Free viewpoint action recognition using motion history
volumes.
21 / 22
44. Outline
Introduction
Framework Overview
Experimental Conditions
Results and Analysis
Conclusions and Future Work
Computer Vision and Image Understanding, 104(2-3):249{257,
2006.
22 / 22