Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs

ICVSS
MARINA DI RAGUSA JULY 2014
Prof. Rita Cucchiara
DIPARTIMENTO DI INGEGNERIA Enzo Ferrari
Università di Modena e Reggio Emilia, Italia
Egocentric vision
tracking and recognizing human signs
From fundamentals to applications
http://www.Imagelab.ing.unimore.it

Rita Cucchiara ICVSS 2014, Italy
AGENDA
Egocentric vision: from applications to fundamentals (and viceversa)
• Introduction
• Challenges in ego-vision problems
• Recognizing ego-gestures by motion
• The (unsolved) tracking problem, in ego-vision too
• Discussion  at Ragusa Ibla

1.INTRODUCTION
Can we know what you are looking at?

EGOCENTRIC VISION
Egocentric vision ( “Ego-Vision”)
models and techniques for understanding what a person sees, from the
first person’s point of view and centered on the human perceptual needs.
Often called first-person vision, to recall the needs of using wearable cameras
(e.g. on glasses mounted on the head) for acquiring and processing the same
visual stimuli that human acquire and process.
a broader meaning …..
to understand what a person sees or want to see
or would like to see (e.g. in case of vision impairments)
and to exploit similar learning,
perception and reasoning paradigms of humans..

RESEARCH @IMAGELAB
from surveillance…
to new vision sensors
Floorimage
Drones
Smartphones Ego-vision

A SMALL INCOMPLETE STORY..
1961 Edward O. Thorp,
(with Shannon) built a
computerized timing
devices for cheating at the
game of roulette
( from [Thorp ICSWC98] )
1980.. now Steve Mann (now
at Univ. Of Toronto) defines
many concepts of wearable
computer for vision
“My current wearable prototype, equipped with head-mounted
display, cameras, and wireless communications, enables computer-
assisted forms of interaction in ordinary situations-for example,
while walking, shopping, or meeting people-and it is hardly
noticeable.” [Mann Computer1997]
1998… now MIT
wearable lab
Alex Pentland
Bernt Schiele…
……..

A SMALL INCOMPLETE STORY
2004 Richard Devaul ……….
Phd thesis in the memory glass
s. A. Pentland
[Devaul MIT 04]
2009
1° CVPR
workshop on
Egocentric
Vision
By Philipose,
Herbert, RenMITrill2000 …Google Glass
2012
2° CVPR
workshop on
Egocentric
Vision
By Regh,
Ramanan, Rem,
Fathi, Pisiavash
2014
3° CVPR
workshop on
Egocentric
Vision
By Kitani, Lee,
Rioo, Fathi
Three papers at
CVPR2014
A session on
IWCV 2014
….
Here we are.
T. Kanade,
“First-person,
inside-out
vision,” 2009,
keynote
Google temporally banned facial recognition technology on
Google Glass due to privacy concerns,

WHY NOW, IN 2014 ?..
Technology
(hw & sw
availabilty)
Applications
Systems
Services
Market
needs
ideas

APPLICATIONS (OR MARKET NEEDS?)
Recording your life digitally..
Memex 1945 Vaneer Bush
BigBrother 1997
Mylifebits Gordon Bell Microsoft 2000
Sensecam Microsoft..
Sony Core
Google Glass
Life caching, Lifeblogging..
Human Augmentation
Applications
Creating cognitive and
physical improvements as
an integral part of the
human body.
(Gartner IT Glossary 2012)
Life Logging

LIFE LOGGING: MARKET
a megabit for every second in a year…. roughly 10 million seconds per year.
10 Tbit necessary for everything I look at for a year
MEMOTO 2 days memory with 2 shot at minute; 4 GB data per day amounts to
up to 1,5 terabyte per year… you can put on his cloud..
Autographer is a camera capable of shooting up to 2,000 shots a day
Gopro ( see @youtube)
The Lifelogger camera driving design requirements:
1. compact and small enough to wear like a necklace
2. wide field of view to capture what I see
3. battery life to last a full day

AUGMENTED VISION
Who is who?
Augmenetd vision for
Security surveillance
Smart manufacturing
Natural interfaces, HCI 2.0
Wellbeing
Education
Entertainment
……

AUGMENTED VISION
Applications and market needs…. MANY
Applications for reading for impaired and blind people
--OrCam (Shasua 2010, Hebrew Univ spin-off)
--Applications for smart manufacory
- Viewranger (japan)
Applications for security and surveillance

2. CV CHALLENGES FOR EGOVISION
Why egocentric vision is not so trivial?

CV CHALLENGES FOR EGOVISION (1/6)
a. Hardware
• Design new hardware [Badino, Kanade MVA 2011]
• Exploit real-time capabilities for egovision
At UNIMORE & ETHZ: Odroid XU ARM Exynos 5 Heterogeneous octa core.
Google Glass Vuzix M100 Kopin Golden-i Olympus Meg4.0

b. Recognizing FoA and PoI
• Eye-tracking; & ego vision [Tsukada, ETRA2012]
• Estimating FoA [Li ICCV2013], [OgakiCVPRW2012], [Fathi
ECCV2012]
A.Fathi, Y.Li, J. Rehg, Learning to Recognizing Daily Actions using Gaze ECCV 2012

c. Recognizing head motion
• Head/body motion for outdoor summarization [ Poleg CVPR2014]
• Motion for indoor summarization [Grauman CVPR2013]
• Motion for supporting attention [Matsuo, CVPRW2014]
• Motion for SLAM as in robotics [Bahera, ACCV2012]
Cumulative displacement curves:
Poleg, Arora, Peleg Temporal Segmentation of Egocentric Videos CVPR2014

UNDERSTANDING MOTION
Dense OF vs Sparse OF
Classical OF approach Classical LK approach
Image Brightness Constancy assumption
0








t
E
dt
dy
y
E
dt
dx
x
E
  0 t
T
EE v
     0.  iti EvuE pp   
   
   
 
 
 
12512225
25
2
1
2525
22
11
 
































bdA
E
E
E
v
u
EE
EE
EE
t
t
t
yx
yx
yx
p
p
p
pp
pp
pp
    




























ty
tx
yyxy
yxxx
TT
EE
EE
v
u
EEEE
EEEE
bAdAA 1252521222
Farneback OF [2003]
Faster assumes that a local
neighborhood of each pixel an image
can be represented on a polynomial
basis……
1/2 should not be large (1 = larger eigenvalue)
Iterative Lucas-Kanade Algorithm [1981]

COMPUTING MOTION
optical flow…
Dense optical flow:
• Horn Shunck [AI 1981]
•Farneback [SCIA 2003]
•Liu, Chellappa Rozenfeld [ICPR2002]eigenvalues
•Medioni [PAMI 2008] tensor voting
•…
Sparse optical flow:
•Luckas Kanade [IJCAI 1981]
•Other keypoints
(SIFT, SURF…)
It is not enough for understanding head motion

d. Recognizing objects
• Objects useful for humans [Fathi, CVPR2013, ]
• Objects in the hand [Fathi, Rehg CVPR2011]
• Target tagging in the scene [Pirsiavash, Ramanan CVPR 2012]
• Objects around you [Yvashita et al ICR2014]

e. Recognizing actions
Self-actions, gestures [Kitani, CVPR 2013; Baraldi, EVW2014]
Actions of people, social actions [Ryoo, CVPR 2013]; [Alletto EFPVW 2014],
[Narayan CVPRW2014]
Actions in the environment (sport..) [Kitani, IEEE PC Magazine 2012]

EGO-GESTURE RECOGNITION
• monocular hand gesture recognition
• deal with static (hand pose) and dynamic
gestures (motion).
• very few positive samples
• Many changes in luminance

HAND SEGMENTATION IN EGOVISION
Hand recognition:
It is an old problem, many approaches in different contexts:
Skin classification: [Khan et al ICIP2010] Random forest: ( better than BN,
MP,NB, AdaB…)
Background subtraction after image registration [Fathi ICCV 2011] (assuming
static bckg, hands with objects etc..)
Generic object recognition : [Li, Kitani CVPR 2013] sparse feature selection
and a battery of RF trained with different luminance conditions

AN EGOVISION SOLUTION
Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
- SLIC (Simple linear Iterative clustering* [Achanta 2010])
- K means in 5D (Lab+xy)

Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
- Descriptors
- -mean and covariance in RGB
- LabH and HSVH
- 27 Gabor filters (9 orientation, 3 scales
7x7,13x13,19x19)
- HoG

Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Classifier
- Collection of Random Forests
- Indexed by a 32 bin RGBH
- It encodes the appearance of the scene and
the global luminance
- Hp: bkg and hands similar color changes
Feature
vector
Scene
luminance
Global
Luminance
feature

Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Temporal smoothing
in a window of k frames
Posterior probability to be or not
a hand pixel in a previous window
Estimated priors

Superpixel
segmentation
Temporal
coherence
Classification by
Collection of RFs
Superpixel
descriptors
Spatial
coherence
Spatial consistency
Eliminate spurious superpixels
Close holes
Use grabcut using posteriori as a seed point

HAND SEGMENTATION
there is a significant improvement in
performance when all the three consistency
aspect are used together:
• illumination invariance (II)
• temporal smoothing (TS)
• spatial consistency (SC).
• In standard datasets

CAMERA MOTION REMOVAL
Extract dense keypoints
Extract dense keypoints
Estimate Homography
-
-
Apply to
Original frame sequence
Output frame sequence
without camera motion
• In ego-camera views hands movement is usually not consistent with
camera motion, resulting in wrong matches between the two frames.
• a segmentation mask disregards feature matches belonging to hands
Dense
motion
Object detection

EGO-GESTURE FEATURE
• dense trajectories, HOG, HOG, MBH [Wang CVPR2013] extracted around
hand regions.
BoW
Power-normalization
and concatenation
BoW
BoW
BoW
TD descriptors
HOG descriptors
HOF descriptors
MBH descriptors SVM 1-vs-1

EXPERIMENTAL RESULTS
Datasets:
• The Cambridge-Gesture
database, with 900 sequences of
nine hand gesture types under
different illumination conditions;
• Our Interactive Museum
Dataset, an ego-centric gesture
recognition dataset with 700
sequences from seven gesture
classes performed by five subjects.
• The EDSH dataset, which
consists of three egocentric videos
with indoor and outdoor scenes
and large variations of
illumination.
See results in [Baraldi CVPRW2014]

f. Tracking: recognizing among the time
• tracking target objects
• tracking face and people [Alletto, ICPR 2014]
• Multiple target tracking

6. TRACKING: THE BIG CHALLENGE

TRACKING
Single object
from a single camera
static camera, moving camera
Smartphone, egovision………
Multiple objects
Single object
from multiple camera
Overlapped or Disjoined FoVs
Network of Egovision systems
Heterogeneous NoCs
Multiple objects
from distributed cameras
Target
Single multiple
FoV
Single
multiple
See at imagelab.ing.unimore.it

TRACKING IN EGOVISION
Straightforward relationships with robot vision… !
Similar to video-surveillance
• fast, real-time
• similar scenes ( typically people, social life, children..)
• many similar tasks( detection, action recognition , tracking)
• as in people tracking motion is unpredictable
but
• Unconstrained
• Large different motion factors
• Frequent Changes of field of view
• Interactive

OUR VISION…
How can we track objects?

HOW DO WE SEE?
The path: (E. Kendall, 2008)
1) The stimuli from retinae through two parallel path reach the lateral
geniculate nucleus in thalamus, then to the cortex in the occipital lobe and
then in the temporal and frontal lobes.
2) Two parallel paths
1) The way of WHAT in the temporal lobe perceives color, shape of
the object, the face..
2) The way of WHERE in parietal lobe provides localization of such
objects
3) Centers hierarchically connected, process
information and than come back
to the WHAT area and work together
Based on attention and purpose
where
what

SINGLE TARGET TRACKING
analyzing the behavior of current tracking solutions in following a given target
fray-by-frame
Tracking is the task of generating an
inference about the motion of an object
given a sequence of images *.
Arnold Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara,
Afshin Deghghan and, and Mubarak Shah, Visual Tracking: an
Experimental Survey, IEEE TPAMI, 2014.

SINGLE TARGET TRACKING
1) Region of interest
2) Representation:
2.1)how to observe invariant and variant features in the frame and
2.2) how to hold them in an internal representation
1) Inference Method
2) Model Update
The data or
The observation
The state ( the object model)
The inference
The (unsolved) questions in tracking a
given target

THE HARDNESS OF TRACKING
Which is the invariance that can be perceived and maintained along the
time?
Tracking is hard as nothing is fixed:
• Problems of lights: the target aspect, the illumination,
• Problems of motion: the object/camera motion,
• Problems of scene: the occlusion, the confusion...
• …. Searching for the invariance in the video
Despite of the variety in the video ….most papers use 6 – 10 long videos
only. This covers variety poorly.

WHY TRACKING IS SO HARD?
1. light
10. scene contrast
2. object surface cover
3. object specularity
4. object transparency
5. object shape
11. scene occlusion
9. scene confusion
8. scene clutter
12. camera moving
13. camera zoom
6. motion smoothness
7. motion coherence
Light
Object aspect
Object Motion
Camera motion
Scene context
14. Long videosTemporal
coherence
14 Challenges in video tracking

LIGHT AND OBJECT ASPECT
1. light
2. object surface cover
3. object specularity
4. object transparency
5. object shape
Light
Object aspect

MOTION
6. motion smoothness
7. motion coherence
Object Motion
12. camera moving
13. camera zoom
Camera motion
14. Long videos
Temporal
coherence

SCENE
10. scene contrast
11. scene occlusion
9. scene confusion
8. scene clutter
Scene context

14 TRACKING CHALLENGES IN 313 VIDEOS
01-LIGHT
02-SURFACECOVER
03-SPECULARITY
04-TRANSPARENCY
05-SHAPE
06-MOTIONSMOOTHNESS
07-MOTIONCOHERENCE
08-CLUTTER
09-CONFUSION
10-LOWCONTRAST
11-OCCLUSION
12-MOVINGCAMERA
13-ZOOMINGCAMERA
14-LONGDURATION

DATASET
ALOV++
http://imagelab.ing.unimore.it/dsm/ or
www.alov300.org

TRACKING IN EGO-VISION
In egovision?
• all the previous problems!!
• relative motion
• No motion of observer but moving target
• Motion of observer but fixed target
• Motion of both observer and target
• ( the dataset at Imagelab.unimore.it)
• EGO_GROUP
• EGO_TRACK

GENERAL PURPOSE TRACKING
Single target tracking, without any constraints
From KALMAN, Ext. Kalman, LKT, Particle Filter, MeanShift, ……
new generations of tracking solutions
• explore multiple features and cues (SIFT; HOGS; SURF etc)
• explore multiple object representation (fragments, graphs)
• explore new optimization methods
• explore new solutions of machine learning
But what about the results?

THE STATE OF THE ART
19 tracking solutions for you..
Tracking by matching
Tracking with a
discriminative classification

• [NCC] Normalized Cross-Correlation
K. Briechle and U. Hanebeck, SPIE 2001
• [KLT] Lucas-Kanade Tracker
S. Baker and I. Matthews, IJCV2004
• [KAT] Kalman Appearance Tracker
H. Nguyen and A. Smeulders, TPAMI 2004
• [FRT] Fragments-based Robust
Tracking
A. Adam, E. Rivlin, and I. Shimshoni, CVPR2006
[MST] Mean Shift Tracking
D. Comaniciu, V. Ramesh, and P. Meer, CVPR2000
• [LOT] Locally Orderless Tracking
S. Oron, A. Bar-Hillel, D. Levi, S. Avidan, CVPR2012

• [IVT] Incremental Visual Tracking
D. Ross, J. Lim, and R.S.Lin, IJCV2008
• [TAG] Tracking on the Afﬁne Group
J. Kwon and F.C. Park, CVPR2009
• [TST] Tracking by Sampling Trackers
J. Kwon, K.M. Lee, 2ICCV 011
• [TMC] Tracking by Monte Carlo sampling
J. Kwon, K.M. Lee,CVPR 2009
• [ACT] Adaptive Coupled-layer Tracking
L. Cehovin, M. Kristan, A. Leonardis,
ICCV2011
• [L1T] L1-minimazioni tracker
X.Mei H.Ling ICCV2009
• [L1O]L1 minimization with occlusion
X.Mei, H.Ling, Y.Wu,E.Blash, L.Bai
CVPR2011

Tracking with a
[FBT]Foreground-background tracking
Ngujen Smeulder IJCV 2006;
[HBT], Hough based tracking
Godec, Roth, Bishof ICCV2011;
[SPT ]super pixel tracking
Wang, Lu, Yang, Yang ICCV 2011;
[MIT] multiple instance learning
Babenko, Yang, Belongie CVPR2009;
[TLD] tracking learning detection
Kalal, Matas, Mikolajczyk CVPR 2010
[STRUC] structured output tracking
Hare, Saffari, Torr ICCV 2011
•

ONE PROBLEM, MANY SOLUTIONS…
RoI
Model update
Visual features
Appearance
motion
predictions
inference
data

REGION OF INTEREST
1. From manual or automatic detectors
2. From moving object segmentation (bkg suppression, OF segmentation)
3. From local features identification;

APPEARANCE
b) Histograms
color histograms ( MST;, TMC, HBT, SPT, )
intensity histograms (FRT, ACT)
Useful only is small patches, otherwise the spatial relationship information has
to be captures elsewhere in the tracking algorithm
c) Feature vectors
Useful is the shape of the object si important ( and constant, at least in some
parts)
- Haar gradients (MIT)
- 2D bin patterns (TLD)
- SURF (FBT)
- Lab-color features and others (HBT)……..
- Be careful in selecting true and stable invariants

APPEARANCE
Some trackers keep the appearance information in the scene,
i.e. add some contextual information
• Background intensity representation ( the methods based on bck
subtraction has a reference background) in surveillance. ([Wren et al
TPAM97],[MOG CVPR’99], [Sakbot TPAMI ‘03] )
• occlusion detection information [AD Hoc ‘PRL ’11,ALIEN ‘13]
• confusion information [Medioni CVPR ‘11]
• motion information of cameras [Qiogui, IASP11]
they cannot always used in general unknown visual contexts

MOTION MODEL
1. Uniform search ( no motion model) es STR, FBT
2. Probabilistic Gaussian motion model es IVT, L1T
3. Motion prediction es KALMAN, ACT ( a linear motion model, sometime
guided by OF)
4. Implicit motion model (with optimization:) es KLT, MST
5. Multiple models : tracking and detection TLD, particle filters, SPT, 3D
affine or projective motion models ,TAG

MOTION
Some considerations:
- Uniform search is only simple,
- better if guided at least by optical flow - in egovision 
- implicit with optimization: only if the motion is small and appearance is
more constant.. Not always possible
- -motion prediction in specific applications is perfect ; for instance in
intelligent transportation system a linear motion prediction is not
questionable.

MODEL UPDATE
The model updating:
• No update: NCC; FRT; MST; KLT no update but search the best
transform of the model. It could be considered trivial but good for short
sequence
• Last seen template, or partial updating (Porikli cvpr ‘06),
• Predicting the new appearance KAT for long term occlusion but for
abrupt changes it can produces errors
• Patches update, TMC and ACT (add or delete pathes)
• Updating and extended model, eg incremental PCA (IVT, TAG, TST)

INFERENCE METHODS
the method: The computational paradigm to find the best location and or
the best state of the target in the new frame.
a) Matching
b) Matching
extended target
c) Matching with
constraint
d) Discriminative
e) Discriminative
with constraint.
T target model, i.e. status
C candidates objects, i.e DATA

INFERENCE METHODS
a) Matching:
-direct gradient ascent (NCC, KLT), or probabilistic matching (MST)
-Matching with particle filtering (IVT, TST, TAG)
Useful if the appearance of target and bck are different to avoid local
maxima; useful in case of good appearance invariance ( problems with
intensity or surface luminance changes); good for occlusions and low
contrast
b) Matching with extended appearance
-Subspace matching in extended models, with some examples (IVT), or
with different track results (TAG, TST)
it is similar to have a long term memory, useful for long term tracking or
with occlusions, more complex
c) Matching with constraints
-adding some rules for the context (TAG), for the positions of patches
(L1T, L1O), or their pose (ACT)

INFERENCE METHODS
d) Discriminative
- Discriminative supervised classifier :
- FBT - Linear Discriminant analysis;
- HBT -segmentation and random forests;
- MIT and SPT - clustering,
- TLD a pool of randomized classifiers
- Often very few examples are available (thus LDA could be better than
multiple instance learning) in case of errors there are problems of drifting
away
e) Discriminative with constraints
• Structured classifier : STR structured classifier, uses as output the
displacement of the target instead of the label per pixels;

[NCC] NORMALIZED CROSS-CORRELATION
Direct target matching by normalized cross-correlation.  
[Briechle et al Spie 2001]
• Intensity values in the initial target box as template;
• Matching around by sampling uniformly around the previous position;
• Take the highest score with NCC at pixel level;
• No updating of the target;
T
2 2
( , ) ( , )
( , )
( , ) ( , )
i i
i i i i
g i m j n t i j
N m n
g i m j n t i j
 

 

 jj
j

[FRT] FRAGMENTS-BASED ROBUST
TRACKING
Matching the ensemble of 10 x 2 patches. [Adam et al CVPR2006] 
• ROI divided into patches;
• New window search around the previous including 10% scale change.
• Each patch intensity histogram compared by Earth Movers Distance.
• The target is not updated.
Robust to changes of poses, occlusions and very simple
Suitable for shape modification
T

[HBT] HOUGH-BASED TRACKING
Discriminative classifier on Lab-color, gradients and positions. 
The Hough Forest provides a probability map of the target.
• Give the sample, represented by features;
• Transform the target with a Hough Forest;
• Back projection from a Hough Forest.;(as GHT’s R table)
• Segment the target using grabcut and hence generate new samples.
HT.
backprojection
[Godec, Roth, Bishof, ICCV, 2011]

[TLD] TRACKING, LEARNING AND
DETECTION
Top detections on LBPs and LKT optical flow are combined by NCC.  fast
• Samples are selected in, around and away from the target to update (labeled
and unlabeled).
• If neither of the two trackers outputs, TLD declares loss and recovers.
• Learn which is the best detector; good for short term occlusion
class.
class.
[Kalal, Matas, Mikolajczyk, CVPR, 2010]
OF
LBP

[STR] STRUCTURED OUTPUT
TRACKING
Structured supervised classifier by {appearance, translation}. 
• The window is described by Haar features with 2 scales.
• Sampling uniformly around the previous position.
• The S-SVM learner update constraint to stay at current location; the
locations which violates the support points are used for the new SVM
class.
Transformation
prediction
patches
[Hare, Saffari, Torr, ICCV, 2011]

EXAMPLE OF STRUCT

IS TRACKING GOOD ENOUGH?
Measuring results is hard

TRACKING MEASURES
Ad object level in a sequence of frames ( for i=1 Nframe):
𝑛tp= 𝑖=1
𝑁𝑓𝑟𝑎𝑚𝑒
ni
tp 𝑛fp= 𝑖=1
nif
p 𝑛fn= 𝑖=1
ni
fn
Precision = (ntp )/(ntp + nfp) Recall= (ntp )/(ntp + nfn)
F-SCORE 𝐅 = 𝟐
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏∗𝑹𝒆𝒄𝒂𝒍𝒍
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝑹𝒆𝒄𝒂𝒍𝒍
(also called Correct track ratio)
At area/pixel level
𝑟𝑖 =
|𝑇 𝑖|
and p𝑖 =
|𝐺𝑇 𝑖|
F1-SCORE F1 =
1
𝑁𝑓𝑟𝑎𝑚𝑒 𝑖=1
2
𝑃𝑖
∗𝑅𝑖
𝑃𝑖
+
𝑅𝑖

TRACKING MEASURES
For measuring the position deviation instead
Deviation=1- 𝒊∈𝑴𝒊 𝒅(𝑪𝑻𝒊−𝑪𝑮𝑻𝒊)
|𝑴𝒊|
d(x,y) distance L2 norm of the centroids
PBM Position Based Matching
PBM=
𝟏
𝑵𝒇𝒓𝒂𝒎𝒆𝒔 𝒊(𝟏 −
𝒅𝟏(𝑻𝒊,𝑮𝑻𝒊)
𝒔𝒑_𝒂𝒗𝒆(𝒊)
) d1(x,y) is the L1 norm
sp_ave is the average semi-perimeter between GT and T
T
GT
T
GT
sp_ave=
(H(T)+W(T)+H(GT)+W(GT))/2

EXPERIMENTAL RESULTS ON ALOV++
A
B
C
D
E
[NCC]
[STR]
[L1O]
[TST]
[TLD]
[FBT]
The upper bound, taking the
best of all trackers at each
frame 10%
The lower bound, what all
trackers can do 7%
About the 30%, correctly tracked only
Survival curves by Kaplan-Meier

COMPARISON ON MATCHING MODELS
For Egovision Comparison on video with motion issues
FRT KAT LKT
LOT MST NCC
FRT using fragments copes well with motion changes
NCC is insensitive to motion changes

COMPARISON ON DISCRIMINATIVE M.

CONFUSION CHALLENGE:
[FBT][NCC][STR] [TLD][TST] [L1O]
CONFUSION.. CROWD short term tracking

OCCLUSION
Conclusions: 1. STR, FBT, TST, TLD and L1T are best here (!).
2. Light occlusion is approximately solved.
3. Full occlusion is still hard for most.

LONG TERM CHALLENGE:
[FBT][NCC][STR] [TLD][TST] [L1O]

F-SCORES
See alov++

IS TRACKING GOOD IN EGOVISION?
Problems in egovision
• Moving head and Moving target
• Changes of Fov
• Changes of luminance
Dataset
V1. (semi-)still observer
V2. Moving head, not coherent motion (abrupt changes in motion patterns)
V3. Camera observer movement, with and without abrupt camera motions
Trackers
Matching based trackers
• NCC , NN ( with color histograms), FRT
Discriminative classifiers based
• HBT , TLD, STR

V1(semi-)still camera
Challenges:
• Changes in object shape (e.g. change in head pose)
• Occlusions between objects
Pros:
• No camera motion: low blur, target losses can be due only to occlusions
• Adaptive models can adapt to changes in object shape
Cons:
• Occlusions are likely to occur, loss detection is needed.
• Adaptive models must detect the loss or they adapt to the occluding object
TRACKING IN EGOVISION: EVALUATION

EGOVISION FROM A STILL PEROSN

results in the first scenario
DICE measure: the overlap
degree between the ground truth
and the predicted bounding box.
V1.1: video without occlusions, the
only challenge are the subject’s pose
changes.
V1.2: recurring occlusions; adaptive
models (STR, TLD, HBT) fail to
discriminate between the original
target and the occluding one and
adapt to it, resulting in tracking
failure.

Results V2: moving camera, still person
Challenges
• Target exits the camera FoV
Scenario:
Pros:
• Person stands still, abrupt lighting changes are not likely
Cons:
• Occlusions are likely to occur, loss detection is needed
• Target can exit camera FoV, loss detection and re-identification are needed
• Adaptive models without loss detection quickly adapt to the background
after a loss

EGOVISION FROM A MOVING HEAD

Tracking results in the second
scenario
V2.1: video with people chatting. HBT
performs poorly due to its lack of loss
detection and recovery. STR cannot
detect the loss either and adapts its
support vectors to the background.
V2.2: tracking of a environmental point
of interest. Target stays still but gets
occluded and exits the camera FoW.
Color based trackers (HBT, NN)
performs poorly due to the difficulty in
discriminating the object based on color.
V2.3: tracking a face under fast
occurring occlusions. Responsive loss
detection (TLD) is needed in order to
stop adapting the model in time. Scale
changes compromise model matching
(FRT,NCC)

V3: moving camera, moving person
Challenges
• Target exits the camera FoW
• Abrupt changes in lighting
• Occasional low image resolution due to motion blur
The most challenging tracking scenario.
Considerations:
• Lack of loss detection results in tracking failure after very few frames
• Adaptive models often cannot cope with the challenges of this scenario and
adapt to background on some degree, resulting in the tracker quickly drifting
• Adaptability to scale changes is needed due to the person moving closer to
objects of interest

EGOVISION FROM A MOVING PERSON

Tracking results in the third scenario
V3.1: face tracking under person
motion. Discriminative colors between
object and background allow good
performances for HBT. NCC performs
well due to the lack of object changes.
V3.2: face tracking under both person
and camera motion. Adaptive models
end up adapting to background. NCC
do not adapt and hence do not drift.
V3.3: face tracking with indoor-outdoor
transition. Abrupt lighting change
during the transition compromise most
trackers. Adaptive models (TLD, STR,
HBT) try to adapt to the transition
resulting in the inability to adapt back to
the object when the lighting stabilizes.

Scenari
o
Still camera,
still person
Moving camera, still
person
Moving camera, moving
person
Video V1.1 V1.2 V2.1 V2.2 V2.3 V3.1 V3.2 V3.3
NN 0.5204 0.2793 0.2314 0.0472 0.1211 0.2552 0.0867 0.1565
HBT 0.5187 0.1177 0.0206 0.1602 0.0333 0.5786 0.1457 0.0973
TLD 0.4838 0.1767 0.5091 0.6372 0.4342 0.2446 0.0237 0.1303
STR 0.6406 0.2397 0.0698 0.5745 0.0801 0.5532 0.0294 0.0879
NCC 0.4326 0.2251 0.4575 0.3769 0.0147 0.3607 0.1834 0.1118
FRT 0.2271 0.2138 0.1406 0.0294 0.0389 0.0984 0.1492 0.0756
Tracking results: table shows the F-measure for each video and each tracke
a lot of work to do…

THE SUPPORT POINTS!!!

IN SIMPLE CASES..
Problem: tracking people for detecting social group [Alletto CVPRW2014]
- We use VJ for initial detection
- HBT and TLD for tracking faces in real time + re-identification
- Classification for orientation
- Correlation clustering for detecting social group
- * MIUR Cluster project in Smart cities «educating city»: recognizing children
social activity 2014-2017

SOCIAL FEATURES
-90° -60° -30° 0° 30° 60° 90°
-75 -45 0 45 75Class
Interval
HEAD detection from egovision and POSE ESTIMATION
determine head yaw angle.
HOG descriptor computed using 8x8 cells, 16 bins per cell. Power normalization is then applied.
Multiclass Linear SVM and HMM used to discriminate different pose classes.
DISTANCE ESTIMATION AND 3D RECONSTRUCTION
No camera calibration implied, random regression forests. A grid model is applied to
estimate the person 3D location accounting for projective deformation.

METHOD OVERVIEW
SVM+ HMM
Random Forest
+
3D estimation
Multiple face detection
Tracking and segmentation
Video Stream
HOG
t-1
t
t+1
t
Face area
estimation
Head Pose estimation
Correlation
Clustering
SSVM
3D bird view model
Groups composition estimation
Distance

EXPERIMENTAL RESULTS

LAST: EXECUTION TIME
An example of comparison on the same video
HBT, STRUC, TLD and
NN(nearest neighbour with histograms)

CONCLUSIONS AND OPEN PROBLEMS
• Single target Detection & Tracking
• two big problems in all video sources
• In egovision the problem is open.
And then
• Recognition action&behaviors………
• Understanding what people seeing
•
• Maybe, it’s not utopia anymore.

HOMEWORKS
- Work with the dataset ALOV++ www.alov.org
- Compare the matching.based and discriminative-based tracking in different
scenario
- Try to understand motivation of failures
- Try to understand which are the weak points for egovision
- Answer to questions
1°) Why HOG, HOF and MBH are used (as well as trajectory shapes) for ego-
gesture recognition with hand segmentation by super pixels,?
2) Which is a comprehensive definition of egocentric vision?
3) Single target tracking approaches can be divided into two broad categories.
Which ones?
4) Why in ego-vision, tracking algorithms based on a single
or multiple memory of targets are more suitable?

ADDITIONAL REFERENCES
• T.-K. Kim and R. Cipolla. Canonical correlation analysis of video volume tensors for action categorization and
detection. Trans. PAMI, 2009
• Y. M. Lui, J. R. Beveridge, and M. Kirby. Action classification on product manifolds. In Proc. of CVPR, 2010
• Y. M. Lui and J. R. Beveridge. Tangent bundle for human action recognition. In In proc. of Automatic Face & Gesture
Recognition and Workshops, 2011
• A. Sanin, C. Sanderson, M. T. Harandi, and B. C. Lovell. Spatio-temporal covariance descriptors for action and gesture
recognition. In Proc. of Workshop on Applications of Computer Vision, 2013.
• L.Baraldi, F.Pace, G.Serra, L.Benini and R.Cucchiara Gesture recognition in ego-centric videos using dense trajectories
and hand segmentation EVW @CVPR2014
• R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels. Technical report, EPFL, 2010.
• H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action Recognition by Dense Trajectories. In Proc. of CVPR, 2011 and
IJCV2013.
• G.Gualdi, A. Prati, R. Cucchiara, "Multi-Stage Particle Windows for Fast and Accurate Object Detection"in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 34, n. 8, pp. 1589-1604, 2012
• Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in CVPR, 2006.
• M. Godec, P. M. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in ICCV, 2011
• Z. Kalal, J. Matas, and K. Mikolajczyk, “Online learning of robust object detectors during unstable tracking,” CVPR
2009
• S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in ICCV, 2011.

THANKS TO
http://imagelab.ing.unimo.it
Imagelab PEOPLE
Rita Cucchiara Giuseppe Serra Marco Manfredi
Costantino Grana Paolo Santinelli Francesco Solera
Roberto Vezzani Martino Lombardi Simone Pistocchi
Simone Calderara Michele Fornaciari Fabio Battilani
Dalia Coppi Patrizia VariniAugusto Pieracci
Stefano Aletto,

Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Similaire à Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs

Similaire à Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs (20)

Plus de mustafa sarac

Plus de mustafa sarac (20)

Dernier

Dernier (20)

Lecture 09 rita cucchiara - egocentric vision: tracking and recognizing human signs