Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs
1. ICVSS
MARINA DI RAGUSA JULY 2014
Prof. Rita Cucchiara
DIPARTIMENTO DI INGEGNERIA Enzo Ferrari
Università di Modena e Reggio Emilia, Italia
Egocentric vision
tracking and recognizing human signs
From fundamentals to applications
http://www.Imagelab.ing.unimore.it
2. Rita Cucchiara ICVSS 2014, Italy
AGENDA
Egocentric vision: from applications to fundamentals (and viceversa)
• Introduction
• Challenges in ego-vision problems
• Recognizing ego-gestures by motion
• The (unsolved) tracking problem, in ego-vision too
• Discussion at Ragusa Ibla
3. Rita Cucchiara ICVSS 2014, Italy
1.INTRODUCTION
Can we know what you are looking at?
4. Rita Cucchiara ICVSS 2014, Italy
EGOCENTRIC VISION
Egocentric vision ( “Ego-Vision”)
models and techniques for understanding what a person sees, from the
first person’s point of view and centered on the human perceptual needs.
Often called first-person vision, to recall the needs of using wearable cameras
(e.g. on glasses mounted on the head) for acquiring and processing the same
visual stimuli that human acquire and process.
a broader meaning …..
to understand what a person sees or want to see
or would like to see (e.g. in case of vision impairments)
and to exploit similar learning,
perception and reasoning paradigms of humans..
5. Rita Cucchiara ICVSS 2014, Italy
RESEARCH @IMAGELAB
from surveillance…
to new vision sensors
Floorimage
Drones
Smartphones Ego-vision
6. Rita Cucchiara ICVSS 2014, Italy
A SMALL INCOMPLETE STORY..
1961 Edward O. Thorp,
(with Shannon) built a
computerized timing
devices for cheating at the
game of roulette
( from [Thorp ICSWC98] )
1980.. now Steve Mann (now
at Univ. Of Toronto) defines
many concepts of wearable
computer for vision
“My current wearable prototype, equipped with head-mounted
display, cameras, and wireless communications, enables computer-
assisted forms of interaction in ordinary situations-for example,
while walking, shopping, or meeting people-and it is hardly
noticeable.” [Mann Computer1997]
1998… now MIT
wearable lab
Alex Pentland
Bernt Schiele…
……..
7. Rita Cucchiara ICVSS 2014, Italy
A SMALL INCOMPLETE STORY
2004 Richard Devaul ……….
Phd thesis in the memory glass
s. A. Pentland
[Devaul MIT 04]
2009
1° CVPR
workshop on
Egocentric
Vision
By Philipose,
Herbert, RenMITrill2000 …Google Glass
2012
2° CVPR
workshop on
Egocentric
Vision
By Regh,
Ramanan, Rem,
Fathi, Pisiavash
2014
3° CVPR
workshop on
Egocentric
Vision
By Kitani, Lee,
Rioo, Fathi
Three papers at
CVPR2014
A session on
IWCV 2014
….
Here we are.
T. Kanade,
“First-person,
inside-out
vision,” 2009,
keynote
Google temporally banned facial recognition technology on
Google Glass due to privacy concerns,
9. Rita Cucchiara ICVSS 2014, Italy
APPLICATIONS (OR MARKET NEEDS?)
Recording your life digitally..
Memex 1945 Vaneer Bush
BigBrother 1997
Mylifebits Gordon Bell Microsoft 2000
Sensecam Microsoft..
Sony Core
Google Glass
Life caching, Lifeblogging..
Human Augmentation
Applications
Creating cognitive and
physical improvements as
an integral part of the
human body.
(Gartner IT Glossary 2012)
Life Logging
10. Rita Cucchiara ICVSS 2014, Italy
LIFE LOGGING: MARKET
a megabit for every second in a year…. roughly 10 million seconds per year.
10 Tbit necessary for everything I look at for a year
MEMOTO 2 days memory with 2 shot at minute; 4 GB data per day amounts to
up to 1,5 terabyte per year… you can put on his cloud..
Autographer is a camera capable of shooting up to 2,000 shots a day
Gopro ( see @youtube)
The Lifelogger camera driving design requirements:
1. compact and small enough to wear like a necklace
2. wide field of view to capture what I see
3. battery life to last a full day
11. Rita Cucchiara ICVSS 2014, Italy
AUGMENTED VISION
Who is who?
Augmenetd vision for
Security surveillance
Smart manufacturing
Natural interfaces, HCI 2.0
Wellbeing
Education
Entertainment
……
12. Rita Cucchiara ICVSS 2014, Italy
AUGMENTED VISION
Applications and market needs…. MANY
Applications for reading for impaired and blind people
--OrCam (Shasua 2010, Hebrew Univ spin-off)
--Applications for smart manufacory
- Viewranger (japan)
Applications for security and surveillance
13. Rita Cucchiara ICVSS 2014, Italy
2. CV CHALLENGES FOR EGOVISION
Why egocentric vision is not so trivial?
14. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (1/6)
a. Hardware
• Design new hardware [Badino, Kanade MVA 2011]
• Exploit real-time capabilities for egovision
At UNIMORE & ETHZ: Odroid XU ARM Exynos 5 Heterogeneous octa core.
Google Glass Vuzix M100 Kopin Golden-i Olympus Meg4.0
15. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (2/6)
b. Recognizing FoA and PoI
• Eye-tracking; & ego vision [Tsukada, ETRA2012]
• Estimating FoA [Li ICCV2013], [OgakiCVPRW2012], [Fathi
ECCV2012]
A.Fathi, Y.Li, J. Rehg, Learning to Recognizing Daily Actions using Gaze ECCV 2012
16. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (3/6)
c. Recognizing head motion
• Head/body motion for outdoor summarization [ Poleg CVPR2014]
• Motion for indoor summarization [Grauman CVPR2013]
• Motion for supporting attention [Matsuo, CVPRW2014]
• Motion for SLAM as in robotics [Bahera, ACCV2012]
Cumulative displacement curves:
Poleg, Arora, Peleg Temporal Segmentation of Egocentric Videos CVPR2014
17. Rita Cucchiara ICVSS 2014, Italy
UNDERSTANDING MOTION
Dense OF vs Sparse OF
Classical OF approach Classical LK approach
Image Brightness Constancy assumption
0
t
E
dt
dy
y
E
dt
dx
x
E
0 t
T
EE v
0. iti EvuE pp
12512225
25
2
1
2525
22
11
bdA
E
E
E
v
u
EE
EE
EE
t
t
t
yx
yx
yx
p
p
p
pp
pp
pp
ty
tx
yyxy
yxxx
TT
EE
EE
v
u
EEEE
EEEE
bAdAA 1252521222
Farneback OF [2003]
Faster assumes that a local
neighborhood of each pixel an image
can be represented on a polynomial
basis……
1/2 should not be large (1 = larger eigenvalue)
Iterative Lucas-Kanade Algorithm [1981]
18. Rita Cucchiara ICVSS 2014, Italy
COMPUTING MOTION
optical flow…
Dense optical flow:
• Horn Shunck [AI 1981]
•Farneback [SCIA 2003]
•Liu, Chellappa Rozenfeld [ICPR2002]eigenvalues
•Medioni [PAMI 2008] tensor voting
•…
Sparse optical flow:
•Luckas Kanade [IJCAI 1981]
•Other keypoints
(SIFT, SURF…)
It is not enough for understanding head motion
19. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (4/6)
d. Recognizing objects
• Objects useful for humans [Fathi, CVPR2013, ]
• Objects in the hand [Fathi, Rehg CVPR2011]
• Target tagging in the scene [Pirsiavash, Ramanan CVPR 2012]
• Objects around you [Yvashita et al ICR2014]
20. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (5/6)
e. Recognizing actions
Self-actions, gestures [Kitani, CVPR 2013; Baraldi, EVW2014]
Actions of people, social actions [Ryoo, CVPR 2013]; [Alletto EFPVW 2014],
[Narayan CVPRW2014]
Actions in the environment (sport..) [Kitani, IEEE PC Magazine 2012]
21. Rita Cucchiara ICVSS 2014, Italy
EGO-GESTURE RECOGNITION
• monocular hand gesture recognition
• deal with static (hand pose) and dynamic
gestures (motion).
• very few positive samples
• Many changes in luminance
22. Rita Cucchiara ICVSS 2014, Italy
HAND SEGMENTATION IN EGOVISION
Hand recognition:
It is an old problem, many approaches in different contexts:
Skin classification: [Khan et al ICIP2010] Random forest: ( better than BN,
MP,NB, AdaB…)
Background subtraction after image registration [Fathi ICCV 2011] (assuming
static bckg, hands with objects etc..)
Generic object recognition : [Li, Kitani CVPR 2013] sparse feature selection
and a battery of RF trained with different luminance conditions
23. Rita Cucchiara ICVSS 2014, Italy
AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
- SLIC (Simple linear Iterative clustering* [Achanta 2010])
- K means in 5D (Lab+xy)
24. Rita Cucchiara ICVSS 2014, Italy
AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
- Descriptors
- -mean and covariance in RGB
- LabH and HSVH
- 27 Gabor filters (9 orientation, 3 scales
7x7,13x13,19x19)
- HoG
25. Rita Cucchiara ICVSS 2014, Italy
AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Classifier
- Collection of Random Forests
- Indexed by a 32 bin RGBH
- It encodes the appearance of the scene and
the global luminance
- Hp: bkg and hands similar color changes
Feature
vector
Scene
luminance
Global
Luminance
feature
26. Rita Cucchiara ICVSS 2014, Italy
AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Temporal smoothing
in a window of k frames
Posterior probability to be or not
a hand pixel in a previous window
Estimated priors
27. Rita Cucchiara ICVSS 2014, Italy
AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Spatial consistency
Eliminate spurious superpixels
Close holes
Use grabcut using posteriori as a seed point
28. Rita Cucchiara ICVSS 2014, Italy
HAND SEGMENTATION
there is a significant improvement in
performance when all the three consistency
aspect are used together:
• illumination invariance (II)
• temporal smoothing (TS)
• spatial consistency (SC).
• In standard datasets
29. Rita Cucchiara ICVSS 2014, Italy
CAMERA MOTION REMOVAL
Extract dense keypoints
Extract dense keypoints
Estimate Homography
-
-
Apply to
Original frame sequence
Output frame sequence
without camera motion
• In ego-camera views hands movement is usually not consistent with
camera motion, resulting in wrong matches between the two frames.
• a segmentation mask disregards feature matches belonging to hands
Dense
motion
Object detection
31. Rita Cucchiara ICVSS 2014, Italy
EXPERIMENTAL RESULTS
Datasets:
• The Cambridge-Gesture
database, with 900 sequences of
nine hand gesture types under
different illumination conditions;
• Our Interactive Museum
Dataset, an ego-centric gesture
recognition dataset with 700
sequences from seven gesture
classes performed by five subjects.
• The EDSH dataset, which
consists of three egocentric videos
with indoor and outdoor scenes
and large variations of
illumination.
See results in [Baraldi CVPRW2014]
32. Rita Cucchiara ICVSS 2014, Italy
CV CHALLENGES FOR EGOVISION (6/6)
f. Tracking: recognizing among the time
• tracking target objects
• tracking face and people [Alletto, ICPR 2014]
• Multiple target tracking
34. Rita Cucchiara ICVSS 2014, Italy
TRACKING
Single object
from a single camera
static camera, moving camera
Smartphone, egovision………
Multiple objects
Single object
from multiple camera
Overlapped or Disjoined FoVs
Network of Egovision systems
Heterogeneous NoCs
Multiple objects
from distributed cameras
Target
Single multiple
FoV
Single
multiple
See at imagelab.ing.unimore.it
35. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGOVISION
Straightforward relationships with robot vision… !
Similar to video-surveillance
• fast, real-time
• similar scenes ( typically people, social life, children..)
• many similar tasks( detection, action recognition , tracking)
• as in people tracking motion is unpredictable
but
• Unconstrained
• Large different motion factors
• Frequent Changes of field of view
• Interactive
37. Rita Cucchiara ICVSS 2014, Italy
HOW DO WE SEE?
The path: (E. Kendall, 2008)
1) The stimuli from retinae through two parallel path reach the lateral
geniculate nucleus in thalamus, then to the cortex in the occipital lobe and
then in the temporal and frontal lobes.
2) Two parallel paths
1) The way of WHAT in the temporal lobe perceives color, shape of
the object, the face..
2) The way of WHERE in parietal lobe provides localization of such
objects
3) Centers hierarchically connected, process
information and than come back
to the WHAT area and work together
Based on attention and purpose
where
what
38. Rita Cucchiara ICVSS 2014, Italy
SINGLE TARGET TRACKING
analyzing the behavior of current tracking solutions in following a given target
fray-by-frame
Tracking is the task of generating an
inference about the motion of an object
given a sequence of images *.
Arnold Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara,
Afshin Deghghan and, and Mubarak Shah, Visual Tracking: an
Experimental Survey, IEEE TPAMI, 2014.
39. Rita Cucchiara ICVSS 2014, Italy
SINGLE TARGET TRACKING
1) Region of interest
2) Representation:
2.1)how to observe invariant and variant features in the frame and
2.2) how to hold them in an internal representation
1) Inference Method
2) Model Update
The data or
The observation
The state ( the object model)
The inference
The (unsolved) questions in tracking a
given target
40. Rita Cucchiara ICVSS 2014, Italy
THE HARDNESS OF TRACKING
Which is the invariance that can be perceived and maintained along the
time?
Tracking is hard as nothing is fixed:
• Problems of lights: the target aspect, the illumination,
• Problems of motion: the object/camera motion,
• Problems of scene: the occlusion, the confusion...
• …. Searching for the invariance in the video
Despite of the variety in the video ….most papers use 6 – 10 long videos
only. This covers variety poorly.
41. Rita Cucchiara ICVSS 2014, Italy
WHY TRACKING IS SO HARD?
1. light
10. scene contrast
2. object surface cover
3. object specularity
4. object transparency
5. object shape
11. scene occlusion
9. scene confusion
8. scene clutter
12. camera moving
13. camera zoom
6. motion smoothness
7. motion coherence
Light
Object aspect
Object Motion
Camera motion
Scene context
14. Long videosTemporal
coherence
14 Challenges in video tracking
47. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGO-VISION
In egovision?
• all the previous problems!!
• relative motion
• No motion of observer but moving target
• Motion of observer but fixed target
• Motion of both observer and target
• ( the dataset at Imagelab.unimore.it)
• EGO_GROUP
• EGO_TRACK
48. Rita Cucchiara ICVSS 2014, Italy
GENERAL PURPOSE TRACKING
Single target tracking, without any constraints
From KALMAN, Ext. Kalman, LKT, Particle Filter, MeanShift, ……
new generations of tracking solutions
• explore multiple features and cues (SIFT; HOGS; SURF etc)
• explore multiple object representation (fragments, graphs)
• explore new optimization methods
• explore new solutions of machine learning
But what about the results?
49. Rita Cucchiara ICVSS 2014, Italy
THE STATE OF THE ART
19 tracking solutions for you..
Tracking by matching
Tracking with a
discriminative classification
50. Rita Cucchiara ICVSS 2014, Italy
THE STATE OF THE ART
19 tracking solutions for you..
Tracking by matching
• [NCC] Normalized Cross-Correlation
K. Briechle and U. Hanebeck, SPIE 2001
• [KLT] Lucas-Kanade Tracker
S. Baker and I. Matthews, IJCV2004
• [KAT] Kalman Appearance Tracker
H. Nguyen and A. Smeulders, TPAMI 2004
• [FRT] Fragments-based Robust
Tracking
A. Adam, E. Rivlin, and I. Shimshoni, CVPR2006
[MST] Mean Shift Tracking
D. Comaniciu, V. Ramesh, and P. Meer, CVPR2000
• [LOT] Locally Orderless Tracking
S. Oron, A. Bar-Hillel, D. Levi, S. Avidan, CVPR2012
51. Rita Cucchiara ICVSS 2014, Italy
THE STATE OF THE ART
19 tracking solutions for you..
Tracking by matching
• [IVT] Incremental Visual Tracking
D. Ross, J. Lim, and R.S.Lin, IJCV2008
• [TAG] Tracking on the Affine Group
J. Kwon and F.C. Park, CVPR2009
• [TST] Tracking by Sampling Trackers
J. Kwon, K.M. Lee, 2ICCV 011
• [TMC] Tracking by Monte Carlo sampling
J. Kwon, K.M. Lee,CVPR 2009
• [ACT] Adaptive Coupled-layer Tracking
L. Cehovin, M. Kristan, A. Leonardis,
ICCV2011
• [L1T] L1-minimazioni tracker
X.Mei H.Ling ICCV2009
• [L1O]L1 minimization with occlusion
X.Mei, H.Ling, Y.Wu,E.Blash, L.Bai
CVPR2011
52. Rita Cucchiara ICVSS 2014, Italy
THE STATE OF THE ART
19 tracking solutions for you..
Tracking with a
[FBT]Foreground-background tracking
Ngujen Smeulder IJCV 2006;
[HBT], Hough based tracking
Godec, Roth, Bishof ICCV2011;
[SPT ]super pixel tracking
Wang, Lu, Yang, Yang ICCV 2011;
[MIT] multiple instance learning
Babenko, Yang, Belongie CVPR2009;
[TLD] tracking learning detection
Kalal, Matas, Mikolajczyk CVPR 2010
[STRUC] structured output tracking
Hare, Saffari, Torr ICCV 2011
•
53. Rita Cucchiara ICVSS 2014, Italy
ONE PROBLEM, MANY SOLUTIONS…
RoI
Model update
Visual features
Appearance
motion
predictions
inference
data
54. Rita Cucchiara ICVSS 2014, Italy
REGION OF INTEREST
1. From manual or automatic detectors
2. From moving object segmentation (bkg suppression, OF segmentation)
3. From local features identification;
55. Rita Cucchiara ICVSS 2014, Italy
APPEARANCE
b) Histograms
color histograms ( MST;, TMC, HBT, SPT, )
intensity histograms (FRT, ACT)
Useful only is small patches, otherwise the spatial relationship information has
to be captures elsewhere in the tracking algorithm
c) Feature vectors
Useful is the shape of the object si important ( and constant, at least in some
parts)
- Haar gradients (MIT)
- 2D bin patterns (TLD)
- SURF (FBT)
- Lab-color features and others (HBT)……..
- Be careful in selecting true and stable invariants
56. Rita Cucchiara ICVSS 2014, Italy
APPEARANCE
Some trackers keep the appearance information in the scene,
i.e. add some contextual information
• Background intensity representation ( the methods based on bck
subtraction has a reference background) in surveillance. ([Wren et al
TPAM97],[MOG CVPR’99], [Sakbot TPAMI ‘03] )
• occlusion detection information [AD Hoc ‘PRL ’11,ALIEN ‘13]
• confusion information [Medioni CVPR ‘11]
• motion information of cameras [Qiogui, IASP11]
they cannot always used in general unknown visual contexts
57. Rita Cucchiara ICVSS 2014, Italy
MOTION MODEL
1. Uniform search ( no motion model) es STR, FBT
2. Probabilistic Gaussian motion model es IVT, L1T
3. Motion prediction es KALMAN, ACT ( a linear motion model, sometime
guided by OF)
4. Implicit motion model (with optimization:) es KLT, MST
5. Multiple models : tracking and detection TLD, particle filters, SPT, 3D
affine or projective motion models ,TAG
58. Rita Cucchiara ICVSS 2014, Italy
MOTION
Some considerations:
- Uniform search is only simple,
- better if guided at least by optical flow - in egovision
- implicit with optimization: only if the motion is small and appearance is
more constant.. Not always possible
- -motion prediction in specific applications is perfect ; for instance in
intelligent transportation system a linear motion prediction is not
questionable.
59. Rita Cucchiara ICVSS 2014, Italy
MODEL UPDATE
The model updating:
• No update: NCC; FRT; MST; KLT no update but search the best
transform of the model. It could be considered trivial but good for short
sequence
• Last seen template, or partial updating (Porikli cvpr ‘06),
• Predicting the new appearance KAT for long term occlusion but for
abrupt changes it can produces errors
• Patches update, TMC and ACT (add or delete pathes)
• Updating and extended model, eg incremental PCA (IVT, TAG, TST)
60. Rita Cucchiara ICVSS 2014, Italy
INFERENCE METHODS
the method: The computational paradigm to find the best location and or
the best state of the target in the new frame.
a) Matching
b) Matching
extended target
c) Matching with
constraint
d) Discriminative
e) Discriminative
with constraint.
T target model, i.e. status
C candidates objects, i.e DATA
61. Rita Cucchiara ICVSS 2014, Italy
INFERENCE METHODS
a) Matching:
-direct gradient ascent (NCC, KLT), or probabilistic matching (MST)
-Matching with particle filtering (IVT, TST, TAG)
Useful if the appearance of target and bck are different to avoid local
maxima; useful in case of good appearance invariance ( problems with
intensity or surface luminance changes); good for occlusions and low
contrast
b) Matching with extended appearance
-Subspace matching in extended models, with some examples (IVT), or
with different track results (TAG, TST)
it is similar to have a long term memory, useful for long term tracking or
with occlusions, more complex
c) Matching with constraints
-adding some rules for the context (TAG), for the positions of patches
(L1T, L1O), or their pose (ACT)
62. Rita Cucchiara ICVSS 2014, Italy
INFERENCE METHODS
d) Discriminative
- Discriminative supervised classifier :
- FBT - Linear Discriminant analysis;
- HBT -segmentation and random forests;
- MIT and SPT - clustering,
- TLD a pool of randomized classifiers
- Often very few examples are available (thus LDA could be better than
multiple instance learning) in case of errors there are problems of drifting
away
e) Discriminative with constraints
• Structured classifier : STR structured classifier, uses as output the
displacement of the target instead of the label per pixels;
63. Rita Cucchiara ICVSS 2014, Italy
[NCC] NORMALIZED CROSS-CORRELATION
Direct target matching by normalized cross-correlation.
[Briechle et al Spie 2001]
• Intensity values in the initial target box as template;
• Matching around by sampling uniformly around the previous position;
• Take the highest score with NCC at pixel level;
• No updating of the target;
T
2 2
( , ) ( , )
( , )
( , ) ( , )
i i
i i i i
g i m j n t i j
N m n
g i m j n t i j
jj
j
64. Rita Cucchiara ICVSS 2014, Italy
[FRT] FRAGMENTS-BASED ROBUST
TRACKING
Matching the ensemble of 10 x 2 patches. [Adam et al CVPR2006]
• ROI divided into patches;
• New window search around the previous including 10% scale change.
• Each patch intensity histogram compared by Earth Movers Distance.
• The target is not updated.
Robust to changes of poses, occlusions and very simple
Suitable for shape modification
T
65. Rita Cucchiara ICVSS 2014, Italy
[HBT] HOUGH-BASED TRACKING
Discriminative classifier on Lab-color, gradients and positions.
The Hough Forest provides a probability map of the target.
• Give the sample, represented by features;
• Transform the target with a Hough Forest;
• Back projection from a Hough Forest.;(as GHT’s R table)
• Segment the target using grabcut and hence generate new samples.
HT.
backprojection
[Godec, Roth, Bishof, ICCV, 2011]
66. Rita Cucchiara ICVSS 2014, Italy
[TLD] TRACKING, LEARNING AND
DETECTION
Top detections on LBPs and LKT optical flow are combined by NCC. fast
• Samples are selected in, around and away from the target to update (labeled
and unlabeled).
• If neither of the two trackers outputs, TLD declares loss and recovers.
• Learn which is the best detector; good for short term occlusion
class.
class.
[Kalal, Matas, Mikolajczyk, CVPR, 2010]
OF
LBP
67. Rita Cucchiara ICVSS 2014, Italy
[STR] STRUCTURED OUTPUT
TRACKING
Structured supervised classifier by {appearance, translation}.
• The window is described by Haar features with 2 scales.
• Sampling uniformly around the previous position.
• The S-SVM learner update constraint to stay at current location; the
locations which violates the support points are used for the new SVM
class.
Transformation
prediction
patches
[Hare, Saffari, Torr, ICCV, 2011]
69. Rita Cucchiara ICVSS 2014, Italy
IS TRACKING GOOD ENOUGH?
Measuring results is hard
70. Rita Cucchiara ICVSS 2014, Italy
TRACKING MEASURES
Lets’ call:
GTi the ground truth in the frame i
Ti the Detected target in the frame i
Match degree at pixel level MD=
|𝑇 𝑖∩𝐺𝑇 𝑖|
|𝑇 𝑖∪𝐺𝑇 𝑖|
Match at pixel level if level
|𝑇 𝑖∩𝐺𝑇 𝑖|
|𝑇 𝑖∪𝐺𝑇 𝑖|
≥ 𝑇ℎ Th=0,5 PASCAL measure [4]
|𝑇 𝑖∩𝐺𝑇 𝑖|
|𝑇 𝑖∪𝐺𝑇 𝑖
Without threshold is called the DICE measure
T
GT
71. Rita Cucchiara ICVSS 2014, Italy
TRACKING MEASURES
Ad object level in a sequence of frames ( for i=1 Nframe):
𝑛tp= 𝑖=1
𝑁𝑓𝑟𝑎𝑚𝑒
ni
tp 𝑛fp= 𝑖=1
𝑁𝑓𝑟𝑎𝑚𝑒
nif
p 𝑛fn= 𝑖=1
𝑁𝑓𝑟𝑎𝑚𝑒
ni
fn
Precision = (ntp )/(ntp + nfp) Recall= (ntp )/(ntp + nfn)
F-SCORE 𝐅 = 𝟐
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏∗𝑹𝒆𝒄𝒂𝒍𝒍
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝑹𝒆𝒄𝒂𝒍𝒍
(also called Correct track ratio)
At area/pixel level
𝑟𝑖 =
|𝑇 𝑖∩𝐺𝑇 𝑖|
|𝑇 𝑖|
and p𝑖 =
|𝑇 𝑖∩𝐺𝑇 𝑖|
|𝐺𝑇 𝑖|
F1-SCORE F1 =
1
𝑁𝑓𝑟𝑎𝑚𝑒 𝑖=1
𝑁𝑓𝑟𝑎𝑚𝑒
2
𝑃𝑖
∗𝑅𝑖
𝑃𝑖
+
𝑅𝑖
72. Rita Cucchiara ICVSS 2014, Italy
TRACKING MEASURES
For measuring the position deviation instead
Deviation=1- 𝒊∈𝑴𝒊 𝒅(𝑪𝑻𝒊−𝑪𝑮𝑻𝒊)
|𝑴𝒊|
d(x,y) distance L2 norm of the centroids
PBM Position Based Matching
PBM=
𝟏
𝑵𝒇𝒓𝒂𝒎𝒆𝒔 𝒊(𝟏 −
𝒅𝟏(𝑻𝒊,𝑮𝑻𝒊)
𝒔𝒑_𝒂𝒗𝒆(𝒊)
) d1(x,y) is the L1 norm
sp_ave is the average semi-perimeter between GT and T
T
GT
T
GT
sp_ave=
(H(T)+W(T)+H(GT)+W(GT))/2
73. Rita Cucchiara ICVSS 2014, Italy
EXPERIMENTAL RESULTS ON ALOV++
A
B
C
D
E
[NCC]
[STR]
[L1O]
[TST]
[TLD]
[FBT]
The upper bound, taking the
best of all trackers at each
frame 10%
The lower bound, what all
trackers can do 7%
About the 30%, correctly tracked only
Survival curves by Kaplan-Meier
74. Rita Cucchiara ICVSS 2014, Italy
COMPARISON ON MATCHING MODELS
For Egovision Comparison on video with motion issues
FRT KAT LKT
LOT MST NCC
FRT using fragments copes well with motion changes
NCC is insensitive to motion changes
76. Rita Cucchiara ICVSS 2014, Italy
CONFUSION CHALLENGE:
[FBT][NCC][STR] [TLD][TST] [L1O]
CONFUSION.. CROWD short term tracking
77. Rita Cucchiara ICVSS 2014, Italy
OCCLUSION
Conclusions: 1. STR, FBT, TST, TLD and L1T are best here (!).
2. Light occlusion is approximately solved.
3. Full occlusion is still hard for most.
78. Rita Cucchiara ICVSS 2014, Italy
LONG TERM CHALLENGE:
[FBT][NCC][STR] [TLD][TST] [L1O]
80. Rita Cucchiara ICVSS 2014, Italy
IS TRACKING GOOD IN EGOVISION?
Problems in egovision
• Moving head and Moving target
• Changes of Fov
• Changes of luminance
Dataset
V1. (semi-)still observer
V2. Moving head, not coherent motion (abrupt changes in motion patterns)
V3. Camera observer movement, with and without abrupt camera motions
Trackers
Matching based trackers
• NCC , NN ( with color histograms), FRT
Discriminative classifiers based
• HBT , TLD, STR
81. Rita Cucchiara ICVSS 2014, Italy
V1(semi-)still camera
Challenges:
• Changes in object shape (e.g. change in head pose)
• Occlusions between objects
Pros:
• No camera motion: low blur, target losses can be due only to occlusions
• Adaptive models can adapt to changes in object shape
Cons:
• Occlusions are likely to occur, loss detection is needed.
• Adaptive models must detect the loss or they adapt to the occluding object
TRACKING IN EGOVISION: EVALUATION
83. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGOVISION: EVALUATION
results in the first scenario
DICE measure: the overlap
degree between the ground truth
and the predicted bounding box.
V1.1: video without occlusions, the
only challenge are the subject’s pose
changes.
V1.2: recurring occlusions; adaptive
models (STR, TLD, HBT) fail to
discriminate between the original
target and the occluding one and
adapt to it, resulting in tracking
failure.
84. Rita Cucchiara ICVSS 2014, Italy
Results V2: moving camera, still person
Challenges
• Changes in object shape (e.g. change in head pose)
• Target exits the camera FoV
• Occlusions between objects
Scenario:
Pros:
• Person stands still, abrupt lighting changes are not likely
Cons:
• Occlusions are likely to occur, loss detection is needed
• Target can exit camera FoV, loss detection and re-identification are needed
• Adaptive models without loss detection quickly adapt to the background
after a loss
TRACKING IN EGOVISION: EVALUATION
86. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGOVISION: EVALUATION
Tracking results in the second
scenario
V2.1: video with people chatting. HBT
performs poorly due to its lack of loss
detection and recovery. STR cannot
detect the loss either and adapts its
support vectors to the background.
V2.2: tracking of a environmental point
of interest. Target stays still but gets
occluded and exits the camera FoW.
Color based trackers (HBT, NN)
performs poorly due to the difficulty in
discriminating the object based on color.
V2.3: tracking a face under fast
occurring occlusions. Responsive loss
detection (TLD) is needed in order to
stop adapting the model in time. Scale
changes compromise model matching
(FRT,NCC)
87. Rita Cucchiara ICVSS 2014, Italy
V3: moving camera, moving person
Challenges
• Changes in object shape (e.g. change in head pose)
• Target exits the camera FoW
• Occlusions between objects
• Abrupt changes in lighting
• Occasional low image resolution due to motion blur
The most challenging tracking scenario.
Considerations:
• Lack of loss detection results in tracking failure after very few frames
• Adaptive models often cannot cope with the challenges of this scenario and
adapt to background on some degree, resulting in the tracker quickly drifting
• Adaptability to scale changes is needed due to the person moving closer to
objects of interest
TRACKING IN EGOVISION: EVALUATION
89. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGOVISION: EVALUATION
Tracking results in the third scenario
V3.1: face tracking under person
motion. Discriminative colors between
object and background allow good
performances for HBT. NCC performs
well due to the lack of object changes.
V3.2: face tracking under both person
and camera motion. Adaptive models
end up adapting to background. NCC
do not adapt and hence do not drift.
V3.3: face tracking with indoor-outdoor
transition. Abrupt lighting change
during the transition compromise most
trackers. Adaptive models (TLD, STR,
HBT) try to adapt to the transition
resulting in the inability to adapt back to
the object when the lighting stabilizes.
90. Rita Cucchiara ICVSS 2014, Italy
TRACKING IN EGOVISION: EVALUATION
Scenari
o
Still camera,
still person
Moving camera, still
person
Moving camera, moving
person
Video V1.1 V1.2 V2.1 V2.2 V2.3 V3.1 V3.2 V3.3
NN 0.5204 0.2793 0.2314 0.0472 0.1211 0.2552 0.0867 0.1565
HBT 0.5187 0.1177 0.0206 0.1602 0.0333 0.5786 0.1457 0.0973
TLD 0.4838 0.1767 0.5091 0.6372 0.4342 0.2446 0.0237 0.1303
STR 0.6406 0.2397 0.0698 0.5745 0.0801 0.5532 0.0294 0.0879
NCC 0.4326 0.2251 0.4575 0.3769 0.0147 0.3607 0.1834 0.1118
FRT 0.2271 0.2138 0.1406 0.0294 0.0389 0.0984 0.1492 0.0756
Tracking results: table shows the F-measure for each video and each tracke
a lot of work to do…
92. Rita Cucchiara ICVSS 2014, Italy
IN SIMPLE CASES..
Problem: tracking people for detecting social group [Alletto CVPRW2014]
- We use VJ for initial detection
- HBT and TLD for tracking faces in real time + re-identification
- Classification for orientation
- Correlation clustering for detecting social group
- * MIUR Cluster project in Smart cities «educating city»: recognizing children
social activity 2014-2017
93. Rita Cucchiara ICVSS 2014, Italy
SOCIAL FEATURES
-90° -60° -30° 0° 30° 60° 90°
-75 -45 0 45 75Class
Interval
HEAD detection from egovision and POSE ESTIMATION
determine head yaw angle.
HOG descriptor computed using 8x8 cells, 16 bins per cell. Power normalization is then applied.
Multiclass Linear SVM and HMM used to discriminate different pose classes.
DISTANCE ESTIMATION AND 3D RECONSTRUCTION
No camera calibration implied, random regression forests. A grid model is applied to
estimate the person 3D location accounting for projective deformation.
94. Rita Cucchiara ICVSS 2014, Italy
METHOD OVERVIEW
SVM+ HMM
Random Forest
+
3D estimation
Multiple face detection
Tracking and segmentation
Video Stream
HOG
t-1
t
t+1
t
Face area
estimation
Head Pose estimation
Correlation
Clustering
SSVM
3D bird view model
Groups composition estimation
Distance
96. Rita Cucchiara ICVSS 2014, Italy
LAST: EXECUTION TIME
An example of comparison on the same video
HBT, STRUC, TLD and
NN(nearest neighbour with histograms)
97. Rita Cucchiara ICVSS 2014, Italy
CONCLUSIONS AND OPEN PROBLEMS
• Single target Detection & Tracking
• two big problems in all video sources
• In egovision the problem is open.
And then
• Recognition action&behaviors………
• Understanding what people seeing
•
• Maybe, it’s not utopia anymore.
98. Rita Cucchiara ICVSS 2014, Italy
HOMEWORKS
- Work with the dataset ALOV++ www.alov.org
- Compare the matching.based and discriminative-based tracking in different
scenario
- Try to understand motivation of failures
- Try to understand which are the weak points for egovision
- Answer to questions
1°) Why HOG, HOF and MBH are used (as well as trajectory shapes) for ego-
gesture recognition with hand segmentation by super pixels,?
2) Which is a comprehensive definition of egocentric vision?
3) Single target tracking approaches can be divided into two broad categories.
Which ones?
4) Why in ego-vision, tracking algorithms based on a single
or multiple memory of targets are more suitable?
99. Rita Cucchiara ICVSS 2014, Italy
ADDITIONAL REFERENCES
• T.-K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and
detection. Trans. PAMI, 2009
• Y. M. Lui, J. R. Beveridge, and M. Kirby. Action classification on product manifolds. In Proc. of CVPR, 2010
• Y. M. Lui and J. R. Beveridge. Tangent bundle for human action recognition. In In proc. of Automatic Face & Gesture
Recognition and Workshops, 2011
• A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Spatio-temporal covariance descriptors for action and gesture
recognition. In Proc. of Workshop on Applications of Computer Vision, 2013.
• L.Baraldi, F.Pace, G.Serra, L.Benini and R.Cucchiara Gesture recognition in ego-centric videos using dense trajectories
and hand segmentation EVW @CVPR2014
• R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels. Technical report, EPFL, 2010.
• H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In Proc. of CVPR, 2011 and
IJCV2013.
• G.Gualdi, A. Prati, R. Cucchiara, "Multi-Stage Particle Windows for Fast and Accurate Object Detection"in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 34, n. 8, pp. 1589-1604, 2012
• Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in CVPR, 2006.
• M. Godec, P. M. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in ICCV, 2011
• Z. Kalal, J. Matas, and K. Mikolajczyk, “Online learning of robust object detectors during unstable tracking,” CVPR
2009
• S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in ICCV, 2011.
100. Rita Cucchiara ICVSS 2014, Italy
THANKS TO
http://imagelab.ing.unimo.it
Imagelab PEOPLE
Rita Cucchiara Giuseppe Serra Marco Manfredi
Costantino Grana Paolo Santinelli Francesco Solera
Roberto Vezzani Martino Lombardi Simone Pistocchi
Simone Calderara Michele Fornaciari Fabio Battilani
Dalia Coppi Patrizia VariniAugusto Pieracci
Stefano Aletto,