This document summarizes an action recognition system that aims to detect conflict behaviors using CCTV footage. It describes detecting behaviors like pushing, punching, kicking, and falling. The system requires visible hands, touches between people, and distinguishable fast motions. It discusses challenges like false depth perception when people are close to the camera, occlusions, lack of fast movements during conflicts, and false detections like hugs. It reviews existing solutions and proposes a two-step approach using pose estimation and deep learning on frame sequences. It discusses challenges like dataset quality and representativeness and uncertainties around modeling occlusions and spatial orientations.
2. CCTV: Detect Conflict Behaviour
Detect:
1. Person pushing another person.
Pushing, punching and kicking is a
hand movement at a speed above
a configurable (not fixed) threshold
value, and ending with a touch to
another person.
2. Person fighting another person
by kicking or punching.
Requirements:
#1 Clearly visible hitting hand,
touch, participants.
#4 Strike motion projection to
camera image is distinguishable as
a fast motion.
3. CCTV: Falling Detection
Detect:
1. Person falling down from a
punch.
2. Person on ground getting kicked
or beat up.
3. Person on ground laying down
Requirements:
#1. Clearly visible standing person
#2. Clearly visible lying person
4. False Depth Perception
Fist is in frontHead is far behind
People located close in angular position to
camera, but have difference in distance
location, on RGB image looks like they
are too close or even touching. In this
case if one is moving fast(dancing,
rotating, etc, ), the other is not influenced
by these moves.
So we should analyse correlation
between movement intensity of people
that close to each other on RGB, and filter
false positives if their movements are
independent.
5. Occluded Participant
Frames 1, 2: Normal behaviour while good visibility
Frame 3: Hit while person is occluded
Frame 4: Fall while person is occluded
6. Occluded Hit
Frame 1: Normal behaviour while good visibility
Frames 2, 3, 4: Occluded hit
7. Power Standoff Without Fast Movements
Every single frame contains no strikes
Sequence of frames contains no fast motion
8. False Hit
A friendly hug, a pat on the shoulder can
be fast and even strong.
The difference from the power struggle
lies in the manner of movements, it is a
complex of movements of various parts of
the body.
9. False Grassing
Many (and perhaps
most) falls are not
due to blows, but
because of ridiculous
accidents
10. Standing Point Lower or Upper Than Ground Level
Impossible to detect
falling related to
ground level.
Problems in full body
position detection.
11. Fighting in the Crowd
Huge count of
persons in the field
of view
Mutual occlusion
and chaotic
movement
Performance
problems
12. Review of Existing Solutions
Group 1: Instant frame classification:
● Body position classification
Lots of false positives
● Motion as smoothed areas classification
Problems:
Group 2: Motion tracking in frame sequence:
● Optical flow for motion estimation and classification Frame rate
dependency
Group 3. Body matching in frame sequence:
● Body parts detection and matching
● Motion sequence classification
13. Used Approach Step 1: Pose Estimation and Analytical Motion
Pose estimation: Detect keypoints and connections. Challenge:
● Closely located persons with body intersections
● Dress on the body
● Hidden/occluded body parts
● Crowded scenes
Multiframe body matching
and action classification
15. Data Representativity and Accuracy
Datasets Variativity
Static features:
body parts,
primitives
Dynamics:
motion matching
speed estimation
Datasets Ground Truth Action classification
16. Challenges
Dataset quality on public artificial
data:
● Slow hits,
● Deceleration before hitting
● Fighting is only dynamics
● Poor action list scenario
● No ground truth
Dataset representativity:
● No touches
● No falling
● Little set of variativity:
○ environment
○ no crowd
○ person’s appearance
17. Challenges and Uncertainties
● Smoothed motion
● Occluded strike
● Spatial orientation estimation
● Performance improvement: GPU parallelism, multiple models serving,
intelligent preprocessing
● Voting system
● Dataset mining and labeling, request for proprietary datasets