A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

1. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work A MKL Based Fusion Framework for Real-Time Multi-View Action Recognition Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso and Paolo Remagnino Digital Imaging Research Centre Kingston University, London, UK December 3rd, 2014 1 / 22

2. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work 1 Introduction 2 Framework Overview 3 Experimental Conditions 4 Results and Analysis 5 Conclusions and Future Work 2 / 22

3. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Background and Motivations Real-time multi-view action recognition: Gain an increasing interest in video surveillance, human computer interaction, and multimedia retrieval etc. Provide complementary

4. eld of views (FOVs) of a monitored scene via multiple cameras Lead to a more robust decision making based on multiple heterogeneous video streams Real-time capacity enables continuous long-term monitoring If possible multiple cameras should be deployed to monitor human behaviour, where data fusion techniques can be applied. 3 / 22

5. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Illustration of the Monitored Scenario C4 C1 C2 C3 4 / 22

6. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Motion-Based Person Detector We use a state-of-the-art motion-based tracker [6]: Each pixel modelled as a mixture of Gaussians in RGB space Background model to

7. nd foreground pixels in a new frame Found foreground pixels grouped to form large regions associated the person of interest Kalman

8. lters used to track foreground detections Person detections generated for every frame 5 / 22

9. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Feature Representation of Videos Use of STIP and improved dense trajectories (IDT) [7] as local descriptor to extract visual features from a video Person detections and frame spans to de

10. ne a XYT cuboid associated with an action performed by the monitored person Apply bag of words (BOWs) to compute the feature vector of a cuboid, where K-Means clustering used for the generation of a codebook 6 / 22

11. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Disciminative Models for Classi

12. cation Let xki 2 RD, where i 2 f1; 2; : : : ;Ng is the index of a feature vector corresponding to a XYT cuboid and k 2 f1; 2; : : : ;Kg is the index of a camera view. We learn a SVM classi

13. er as f (x) = XN i=1 i yik(xi ; x) + b (1) We then compute a classi

14. cation score via a sigmoid function as p(y = 1jx) = 1 1 + exp(f (x)) (2) 7 / 22

15. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Simple Fusion Strategies Ki1i Concantenation of Features: concatenate the feature vectors of multiple views into one single feature vector such that ~xi = [x; : : : ; x] Sum of Classi

16. cation Scores: compute a classi

17. cation score for each camera view p(y = 1jx) as in 2, and then average them as 1K PK k=1 p(y = 1jxk ) Product of Classi

18. cation Scores: apply the product rule to tQhe classi

19. cation scores of all the camera views as K k=1 p(y = 1jxk ) 8 / 22

20. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Multiple Kernel Learning Combine of multiple kernels corresponding to dierent data sources (e.g. camera views) via a convex function such as K(xi ; xj ) = XK k=1

21. kkk (xi ; xj ) (3) where

22. k 0 and PK k=1

23. k = 1 and each kernel kk only uses a distinct set of features from a data source. 9 / 22

24. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Two-Step Optimisation We need to learn the kernel parameters weights (k ) and bias (bk ) of a SVM model, and the combination parameters

25. k in 3. This can be solved as follows: Step-1: optimise over the kernel parameters k and bk while

26. xing the combination parameters

27. k (quadratic programming) Step-2: optimise over the combination parameters

28. k while

29. xing the kernel parameters k and bk (gradient decent) Alternates between two steps iteratively until the system converges to an optimal solution 10 / 22

30. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work IXMAS Multi-View Dataset Created for view-invariant human action recognition [8] Include 13 daily actions, each of which performed 3 times by 12 actors Video sequences collected via 5 cameras, at 23 frames per second and 390 291 resolution We use all 12 actors and 5 cameras and evaluate 11 actions as in [9] Leave-one-subject-out cross validation used in the experiments 11 / 22

31. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Implementation Details A codebook sized 4000 quantised from 100000 randomly selected descriptor features of the training set STIP descriptor uses the entire image plane and the frame span of an action given in the ground truth to de

32. ne a cuboid IDT descriptor relies on the person detections in addition to the frame span All the SVM models use `1 normalisation and the 2 kernel 12 / 22

33. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Person Detection Results cam0 cam1 cam2 cam3 cam4 Figure: Detection results of the motion-based tracker of the

34. rst run of the subject `Alba', for all the camera views. 13 / 22

35. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Results of STIP (Internal Comparison) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 check watch sit down cross arms scratch head get up turn around walk wave punch kick pick up SVM−COM SVM−SUM SVM−PRD SVM−MKL Figure: Class-wise mean recognition rates of all the folds of the compared methods using STIP descriptor, where SVMCOM = 0:819, SVMSUM = 0:820, SVMPRD = 0:815, and SVMMKL = 0:842. 14 / 22

36. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Results of IDT (Internal Comparison) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 check watch sit down cross arms scratch head get up turn around walk wave punch kick pick up SVM−COM SVM−SUM SVM−PRD SVM−MKL Figure: Class-wise mean recognition rates of all the folds of the compared methods using IDT descriptor, where SVMCOM = 0:915, SVMSUM = 0:927, SVMPRD = 0:921, and SVMMKL = 0:950. 15 / 22

37. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Comparison with State-of-the-Art (External Comparison) Method Actions Actors Views Rate FPS Cilla et al. [3] 11 12 5 0.913 N/A Weiland et al. [10] 11 10 5 0.933 N/A Cilla et al. [4] 11 10 5 0.940 N/A Holte et al. [5] 13 12 5 1.000 N/A Weinland et al. [9] 11 10 5 0.835 500 Chaaraoui et al. [1] 11 12 5 0.859 26 Chaaraoui et al. [2] 11 12 5 0.914 207 SVM-MKL (IDT+BOWs) 11 12 5 0.950 25 Table: Comparison of the proposed MKL method using (IDT) descriptor and BOWs, where the methods with `N/A' in the FPS column are oine. 16 / 22

38. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Conlusions and Future Work Proposed MKL based framework outperforms the simple fusion techniques, and the state-of-the-art methods IDT descriptor superior than STIP descriptor for feature representation in action recognition The proposed framework capable of performing real-time action recognition at 25 frames per second For the future, apply to other similar vision problems, and study alternative feature representation and fusion techniques. 17 / 22

39. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Thank you very much! Any questions? 18 / 22

40. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta. Silhouette-based human action recognition using sequences of key poses. Pattern Recognition Letters, 34:1799{1807, 2013. A. A. Chaaraoui, J. R. Padilla-Lopez, F. J. Ferrandez-Pastor, M. Nieto-Hidalgo, and F. Florez-Revuelta. A vision-based system for intelligent monitoring: human behaviour analysis and privacy by context. Sensors, 14:8895{8925, 2014. R. Cilla, M. A. Patricio, and A. Berlanga. A probabilistic, discriminative and distributed system for the recognition of human actions from multiple views. Neurocomputing, 75:78{87, 2012. R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina. 19 / 22

41. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Human action recognition with sparse classi

42. cation and multiple-view learning. Expert Systems, DOI: 10.1111/exsy.12040, 2013. M. Holte, B. Chakraborty, J. Gonzalez, and T. Moeslund. A local 3-D motion descriptor for mult-view human action recognition from 4-D spatio-temporal interest points. IEEE Journal of Selected Topics in Signal Processing, 6:553{565, 2012. C. Stauer and W. Grimson. Learning patterns of activity using real time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(8):747{767, 2000. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 20 / 22

43. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Evaluation of local spatio-temporal features for action recognition. In British Machine Vision Conference (BMVC), 2009. D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars. In IEEE International Conference on Computer Vision (ICCV), pages 1{7, 2007. D. Weinland, M. Ozuysal, and P. Fua. Making action recognition robust to occlusions and viewpoint changes. In European Conference on Computer Vision, 2010. D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion history volumes. 21 / 22

44. Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work Computer Vision and Image Understanding, 104(2-3):249{257, 2006. 22 / 22

A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

Similaire à A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition (20)

Plus de Francisco (Paco) Florez-Revuelta

Plus de Francisco (Paco) Florez-Revuelta (17)

Dernier

Dernier (20)

A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition