Skip to content

Exploring how to develop a model that jointly predicts motion (optical flow) and future frames in a sequential manner from the same hidden state.

Notifications You must be signed in to change notification settings

ozyurtf/continual-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note: Currently, the models are cheating. They memorize the past frame(s) and optical flow(s) and show those as the prediction of the next video frame(s) and optical flow(s). I am currently working on to fix that issue.

Next Frame(s) Prediction

Next Frame(s) Prediction Gif

Optical Flow(s) Prediction

Optical Flows(s) Prediction Gif

Data

The full dataset can be downloaded from here: http://clevrer.csail.mit.edu.

Training

The training process is summarized in the figures below.

Flow Reconstruction Model Image Reconstruction Model

Flow Reconstruction                         Image Reconstruction

Pipeline

Pipeline

After installing the libraries listed in requirements.txt, the training process can be started using the following code:

python train.py\
    --num_predictions 3\
    --embed_dim 512\
    --hidden_size 512\
    --stride 1\
    --num_frames 127\
    --resize_img 224\
    --patch_size 32
  • num_predictions specifies the number of predictions made in each step. For example, if set to 4, the next 4 frames, optical flows, and states are predicted in the current step. The visualizations of the frame predictions and optical flow predictions are saved into the flows and frames folders for each video separately.
  • embed_dim specifies the embedding dimension for CLIP's image encoder.
  • hidden_size specifies the size of the hidden state for the LSTM cell.
  • stride specifies the intervals between predictions. For instance, if the stride is set to 4 and the number of predictions to 3, the 5th, 9th, and 13th frames and the optical flows between the 1st-5th frames, 5th-9th frames, and 9th-13th frames are predicted in the first step. In the next step, the 9th, 13th, and 17th frames and optical flows between the 5th-9th frames, 9th-13th frames, and 13th-17th frames are predicted, and so on.
  • num_frames specifies the number of frames used to train the model. Each video contains 128 frames.
  • resize_img specifies the target dimensions of the images before extracting features with CLIP's image encoder.
  • patch_size specifies the size of the patches used to process images in CLIP's image encoder.

These are all optional parameters, and the code can also run with the simpler command:

python train.py

About

Exploring how to develop a model that jointly predicts motion (optical flow) and future frames in a sequential manner from the same hidden state.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages