We contribute a scalable, open source implementation [1] of the Pooled Time Series (PoT) algorithm from CVPR 2015. e algorithm is evaluated on approximately 6800 human tracking (HT) videos collected from the deep and dark web, and on an open dataset: the Human Motion Database (HMDB). We describe PoT and our motivation for using it on larger data and the issues we encountered. Our new solution reimagines PoT as an Apache Hadoop-based algorithm. We demonstrate that our new Hadoop-based algorithm successfully identies similar videos in the HT and HMDB datasets and we evaluate the algorithm qualitatively and quantitatively.
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
1. Information Retrieval
and Data Science
Paul Ramirez
paul.m.ramirez@jpl.nasa.gov
Madhav Sharan
msharan@usc.edu
ICMR 2017, Bucharest 1
Scalable Hadoop-Based Pooled
Time Series of Big Video Data
from the Deep Web
Dr. Chris Mattmann
mattmann@usc.edu
https://github.com/USCDataScience/hadoop-pot
2. Information Retrieval
and Data Science
2
Information Retrieval and Data Science (IRDS) Group
University of Southern California, Los Angeles, CA https://irds.usc.edu
Dr. Chris Mattmann
Director, IRDS
Chief Scientist JPL
ABOUT
Madhav Sharan
Graduate Student IRDS/JPL
Computer Science for Data Intensive Applications Group
Jet Propulsion Laboratory, Pasadena, CA
Paul Ramirez
Group Supervisor JPL
3. Information Retrieval
and Data Science
OUTLINE
1. Introduction
2. Dataset
3. Hadoop PoT
4. Evaluation
5. Video Space
6. Thanks
3
4. Information Retrieval
and Data Science
INTRODUCTION
• AIM – To create a scalable approach of calculating similarity between all pairs in a
set of videos
• Built on previous effort by Pooled Time Series (PoT) algorithm from CVPR 2015 by
Dr. Michael Ryoo
• We present our dataset and use case of video similarity then our journey of scaling
algorithm on hadoop
4
6. Information Retrieval
and Data Science
HUMAN TRAFFICKING DATASET
HT(Human Trafficking) videos are crawled from internet ads of escorts from
backpage.com
1. TOTAL SIZE - 26Gb
2. TOTAL VIDEOS - 6805
3. AVERAGE VIDEO SIZE - 3.8MB
4. TOTAL RECORDING LENGTH ≈ 2250 hr
5. AVERAGE RECORDING LENGTH = 19.8 secs
6
7. Information Retrieval
and Data Science
HMDB DATASET
HMDB: A large Human Motion DataBase open sourced by serre lab
1. TOTAL SIZE - 1.9Gb
2. TOTAL VIDEOS - 7,000
3. AVERAGE VIDEO SIZE ≈ 0.5 MB
4. TOTAL RECORDING LENGTH ≈ 350 hr
5. AVERAGE RECORDING LENGTH = 3.1 secs
This is an open source labeled dataset used for evaluation of similarity algorithm.
7
10. Information Retrieval
and Data Science
SIMILARITY ALGORITHM
1. Permute across whole video set to form all possible pair of videos
2. For each pair - Calculate mean distance
a. Calculate HOF and HOG for both videos using OpenCV or use cached. Cache HOF and HOG
b. Calculate Pooled time series feature for both videos
3. For each pair - Calculate chi-squared similarity
a. Use cache HOF and HOG
b. Calculate Pooled time series feature for both videos
c. Use mean distance and both series to calculate a similarity score for pair
10
11. Information Retrieval
and Data Science
PROBLEMS
1. Out of Memory (OoM) issues
2. Time consuming Sequential Code
3. Instrumentation and Checkpointing
4. Could only process 500 videos in 2 days
11
16. Information Retrieval
and Data Science
OBSERVED RUNTIME
16
Total time for all Hadoop jobs :
HT - 33.18 hours
HMDB - 26.84 hours
Time difference as per
video length
Similar time for different
video length
17. Information Retrieval
and Data Science
QUALITATIVE EVALUATION
1. Fetch top 5 most similar videos as per PoT
2. Record number of videos with same label (True)
3. Recall = True/Total
4. Every label had highest recall for it’s own label
17
22. Information Retrieval
and Data Science
FUTURE WORK
22
1. Preprocessing videos
1. Removing banners at starting of a video
2. Dividing a video into a set of scenes
2. Adding convolutional features to enable object recognition etc.. HOF and HOG
are too simple
23. Information Retrieval
and Data Science
THANK YOU
23
Questions/Comments?
Madhav Sharan
msharan@usc.edu
@goyal_madhav
@smadha
Dr. Chris Mattmann
mattmann@usc.edu
@chrismattmann
@chrismattmann
https://github.com/USCDataScience/hadoop-pot