Skip to content

Commit 39f0c38

Browse files
authored
[+] Readme updated
1 parent 86364f6 commit 39f0c38

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Action-Recognition-in-Video-using-Two-Stream-Deep-Learning-Architecture
2+
3+
# Abstract
4+
Action recognition task involves the identification of different actions from videos where the action may or may not be performed throughout the entire duration of the video.
5+
Classifying actions (jumping, picking, playing etc) in still images has been solved with a high accuracy using Deep Convolutional Networks. However, for action recognition in videos, there exists only limited work with poor accuracy.
6+
Action recognition from videos requires capturing context from the whole video rather than just capturing information from each frame as compared to an image.
7+
8+
# Literature Review
9+
We aim to extend the existing model of Deep Convolutional Neural Network used for classifying action in still images. Relevant work already exists for the same which uses stacked frames for input to CNN, but the results are poor. So, to improve upon this architecture, we implement a new architecture for action recognition in videos which is based on the human visual system. This architecture is based on the human visual cortex system which uses two-streams to detect and recognise motion:
10+
- Ventral Stream: used for performing object recognition
11+
- Dorsal Stream: used for recognising motion
12+
The author’s architecture involves two separate recognition streams - spatial and temporal. Spatial Stream performs action recognition in still frames which extracted from the video. Temporal Stream recognises action from motion using dense optical flow.
13+
14+
15+
16+
# Architecture
17+
We can decompose a video into two components i.e. spatial and temporal. Each of the two streams are implemented using deep CNN:
18+
- Spatial Stream ConvNet: Spatial Stream carries information about the objects and scene in the video-in RGB format. This network operates on individual frames extracted from the video at some sampling rate.
19+
- Temporal Stream ConvNet: Temporal Stream carries information about the movement of the of the object(or the camera) in the scene (refer Fig. 2) i.e. the direction of the object. The input fed to this model, is formed by stacking multiple optical flow displacement images between consecutive video frames.
20+
21+
# Dataset
22+
We are using the famous dataset called UCF101 for our
23+
project. UCF101 is a dataset of human actions.We extracted 1. UCF101
24+
have 101 classes which are divided into five types:
25+
- Human-Object Interaction
26+
- Body-Motion Only
27+
- Human-Human Interaction
28+
- Playing Musical Instruments
29+
- Sports
30+
The dataset consists of realistic user-uploaded videos containing
31+
cluttered background and camera motion. It gives the largest diversity in terms of actions and with the presence of large variations in camera motion, cluttered background, illumination conditions.
32+
33+
# Approach
34+
The motivation is to capture both, the spatial feature of the video as well as Temporal ones. For example, To classify either the action being performed by a person in the video is archery or playing basketball, Our spatial network captures the still-frame information about the action being performed. Basically it’s doing classification task on frame of video. Temporal network tries to distinguish the action using motion in the video. For this motion we are using segmentation based optical flow in both, horizontal and vertical directions. Both the models uses ResNet34 as the underlying network.
35+
36+
# Team Members
37+
- [Saurabh Kumar](https://github.com/skwow)
38+
- [Kaustav Vats](https://github.com/kaustavvats)
39+
- [Manish Mahalwal](https://github.com/mahalwal)
40+

0 commit comments

Comments
 (0)