End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

Code and proposed experimentation setup for the paper End-to-End Semantic Video Transformer for Zero-Shot Action Recognition. The annotated class descriptions can be found in the Annotations folder.

Proposed Fair ZSL Test Setup

We pool the valid test classes from several benchmark datasets to form a novel test set. Altogether, there are 30 unique classes from the UCF-101, HMDB-51, and ActivityNet datasets, as shown in the table below. We handpick each class carefully such that it does not violate the zero-shot learning (ZSL) premise.

Dataset	Class
UCF	Pizza Tossing
UCF	Ice Dancing
UCF	Handstand Walking
UCF	Handstand Pushup
UCF	Mixing
UCF	Wall Pushups
UCF	Horse Race
UCF	Playing Dhol
HMDB	Draw Sword
HMDB	Sword Exercise
HMDB	Chew
ActivityNet	Applying sunscreen
ActivityNet	Beach soccer
ActivityNet	Cleaning shoes
ActivityNet	Cleaning sink
ActivityNet	Cutting the grass
ActivityNet	Doing karate
ActivityNet	Doing kickboxing
ActivityNet	Drinking beer
ActivityNet	Drinking coffee
ActivityNet	Fun sliding down
ActivityNet	Hand car wash
ActivityNet	Making an omelette
ActivityNet	Painting fence
ActivityNet	Playing water polo
ActivityNet	River tubing
ActivityNet	Snow tubing
ActivityNet	Starting a campfire
ActivityNet	Washing face
ActivityNet	Washing hands

We next explain the rationale behind excluding the overlapping classes and completely irrelevant classes in the proposed test set.

Overlap between Datasets

In the figure below, we visualize the semantic embeddings of the classes in Kinetics, ActivityNet and UCF-101 datasets. We see that there are several classes in all the test datasets that directly overlap with the training dataset (Kinetics), which is a violation of the ZSL paradigm.

Specifically, the visualization of the overlapping classes between UCF-101 and Kinetics-600/700 are shown below. The Kinetics classes in green are considered as overlapping by comparing the cosine distance between the Word2Vec embeddings with a threshold and removed in B. Brattoli et al., "Rethinking zero-shot video classification: End-to-end training for realistic applications", CVPR 2020. The Kinetics classes in red are given by Word2Vec as the semantically nearest classes to the corresponding UCF-101 class, but not removed in the B. Brattoli et al. paper. For several cases, the actual closest Kinetics classes, shown in blue, are missed by Word2Vec. They are almost identical to the corresponding UCF-101 classes, and thus violate the ZSL idea.

Irrelevant Classes

In the figure below, we breakdown the performance of the proposed model over all the classes in the UCF dataset (i.e., not only the ones included in the proposed test set). We observe that for several classes such as nunchucks, YoYo, unevenbars, the proposed approach is unable to classify even a single video correctly. This problem is not due to the proposed method, but due to the sheer dissimilarity of these classes with respect to the training classes in the Kinetics dataset. Since any practical algorithm will miss such classes, this emphasizes the need for removing classes that are completely irrelevant with respect to the training set from the test set.

Computational Efficiency

Thanks to the scalability of the proposed SVT (semantic video transformer) model, we are able to vary the length of the input video snippet (i.e., number of frames), which also leads to an increase in the number of input tokens. In Table 3 in the paper, we see a significant increase in the performance when the number of input frames are increased from 8 to 96. Increasing the number of video frames is intuitive since it allows a model to better capture the spatiotemporal activities that span several frames. However, due to the current GPU limitations, we are unable to further increase the input length. On the other hand, even after increasing our model complexity to accommodate 96 input frames, our model is still more computationally efficient as compared to the I3D model with 8 input frames, which requires 10.8 TFLOPS for inference, in contrast to the proposed SVT-8 model, which only requires 0.79 TFLOPS, and SVT-96, which requires 7.57 TFLOPS.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Annotations		Annotations
Code		Code
LICENSE		LICENSE
Overlaps.png		Overlaps.png
README.md		README.md
bar.png		bar.png
tsne.png		tsne.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

Proposed Fair ZSL Test Setup

Overlap between Datasets

Irrelevant Classes

Computational Efficiency

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Secure-and-Intelligent-Systems-Lab/SemanticVideoTransformer

Folders and files

Latest commit

History

Repository files navigation

End-to-End Semantic Video Transformer for Zero-Shot Action Recognition

Proposed Fair ZSL Test Setup

Overlap between Datasets

Irrelevant Classes

Computational Efficiency

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages