-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Adding an example on video classification #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is great stuff! Do you have a notebook version so we can take a look at the visualizations?
|
Thank you, @fchollet for the review. I have addressed all your comments. Here's a Colab Notebook for visualizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
|
@fchollet, I tried to incorporate your suggestions but it weirdly affects the performance quite a bit. I am unsure as to why this might be the case. Would you be able to take a look at the notebook? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trying running the notebook with a few simplifications:
- apply pooling in the feature extractor to limit the number of features
- simplify the classification model
- shuffle the training data and use a validation split to monitor overfitting
Notebook: https://colab.research.google.com/drive/1sfkoKxEF_kfGU1vi-_rNOQeu_Ydx1EtF?usp=sharing
The results I'm seeing are consistent with what I'd expect given the small number of samples and the large model sizes: quick overfitting and low generalization.
To train a model with this many parameters (the classification model you had has 10M parameters) you'd need tens of thousands of samples at the very least. Right now there are 152 training samples. This is just impossible. I recommend both simplifying the model (e.g. some of the changes outlined above) and increasing the size of the training data by at least 10x.
If you get to 1000 training samples you can probably manage to train a classification model with 10-100k parameters.
|
|
||
|
|
||
| train_labels = prepare_labels(train_labels) | ||
| test_labels = prepare_labels(test_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not noticed we had 152 training samples and 34 validation samples, for 5 classes. This seems very insufficient. You'd need at least 10x more.
|
Thank you for your inputs, @fchollet. I realized I had introduced a pesky bug during preparing the frame sequences. This was hurting the mappings between the videos and labels. I took care of it and incorporated the following suggestions:
With these changes, the performance is good enough. Here's the notebook. Additionally, your contributions to this would-be example are non-trivial IMHO. They go way beyond just code reviews and improvement suggestions. Therefore, I would very much welcome the idea of adding you as a co-author for the example. LMK. |
|
Glad you were able to fix the bug! I've been thinking about this example some more. This would be our first video classification example, which is great. But that means it should also exemplify best practices -- techniques that will work well across most video classification problems, not just this specific dataset. Right now, some things are very good, like the general idea of using a CNN for feature extraction and then a RNN. However, some things could be improved. Here are some things I'd like to see in a generic video classification example:
There's no real justification for using exactly N frames per video. In the real world your videos will have different lengths. The justification "we need to batch images" doesn't work because you can just pad shorter videos with zeros and generate a padding mask, which the RNN will take into account (it will skip masked frames, so it won't even slow down training). How that would work in practice: use a batch size of 1 during feature extraction, then pad the vector sequences with zeros and add a mask input to the classification model, which would go in the
5 frames is very low and doesn't justify the use of a RNN as the importance of order will be extremely marginal. We need the inputs to be actual videos, not a small collection of pictures, so more like 20+ frames.
As mentioned, at least a few hundreds per class. If you have 5 balanced classes then 1000-2000 is a reasonable number.
Don't worry about it, I'm just the janitor here. Let me know. |
|
Thanks! 1> Could you provide a minimal code example / relevant resource for this? 2> Alright. 3> Alright. But also note that because of broken videos (videos that OpenCV could not capture) the number of available videos got reduced further. |
I imagine you could do something like (untested code, just writing it up here): frame_masks = np.zeros(shape=(num_samples, max_seq_length), dtype='uint8')
frame_features = np.zeros(shape=(num_samples, max_seq_length, num_features), dtype='float32')
for i, batch in enumerate(dataset): # Dataset is batched with size 1
length = batch.shape[1] # This is different from video to video
for j in range(length):
frame_features[i, j, :] = feature_extractor(batch[:, j, :]) # feature_extractor is just a CNN that returns vectors of features
frame_masks[i, :length] = 1 # 1 = not masked, 0 = masked
... # later
frame_features_input = Input((max_seq_length, num_features))
mask_input = Input((max_seq_length,), dtype='uint8')
x = GRU(...)(frame_features_input, mask=mask_input)
... # later
model = Model([frame_features_input, mask_input], output)
model.fit([frame_features, frame_masks], labels, ...)Does that make sense? Can't guarantee it will work exactly as is, but it gives you an idea of what is supposed to happen. |
|
@fchollet, I ended up developing a Here's the full notebook. Additionally, given that we are now covering the best practices you mentioned above in the example, is it possible to limit the number of examples to the current setting in the interest of the runtime (there will be extensive notes about the data regime, though)? Update Logs after training for 5 epochs: |
I think this might be because you won't get multiprocessing on Colab. But regardless of the environment I strongly recommend using a A random thing you can try to diagnose the issue is to use
Is there an obstacle with getting more data, or the issue the run time? |
I tried it on a GCP VM (N1-standard-8) too but the performance was roughly the same.
I am more inclined toward creating a standard dataset for this since we are trying to showcase a couple of best practices here. Besides, showing the readers how to create a standard video classification dataset as a part of the example would be beneficial too.
Runtime is currently the main issue. Additionally, if you could check the notebook in my previous comment and let me know if it's close to what you had envisioned that would be helpful. |
|
I checked out the notebook and the model looks great!
Do you mean an end-to-end tf.data dataset where you provide video filenames and labels and it gives you an iterable with encoded video frames, masks, and encoded labels? This is certainly doable. To avoid doing redundant feature extraction work you're going to want to use caching though. I would encourage you to go this route (in particular to preserve TPU compatibility). Actually this is currently a major issue with your |
Then let's increase the data size and let's reduce the number of epochs (while documenting performance achieved for the full number of epochs). |
|
Okay let me look into these. TPU compatibility is questionable if we want to extract the frames using OpenCV, though. I would do that using |
You could do this, but I would recommend doing the Python-only parts of the preprocessing beforehand, then using |
This sounds like what we were doing previously. Reading the videos, cap their frames to a predefined Update With the current dataset (594 videos in the train set and 224 in the test) it takes a total of 14 minutes and 37 seconds to fully prepare the arrays beforehand. Another point to note is I did this in the "High RAM" setting of Colab Pro. I am not sure the runtime will be the same for non-pro Colab users. However, the time for training for 5 epochs is now extremely fast. Here's the full notebook. The data preprocessing time will naturally get increased if we increase the number of samples. This may actually go well beyond the permitted runtime. @fchollet, let me know how you'd want us to proceed from here. |
Yes, indeed, and that strategy was completely fine.
Does it make sense it this case to attempt to parallelize the preprocessing? If most of the time spent is OpenCV, and if that algo is already parallel, then there's not much we can do. Also, it seems we can increase the epoch number if training is so fast. I think if we keep the total runtime below 30min it will be a success. How many training samples would that get us? |
I was going to suggest this but the OpenCV backend utilities are already threaded. So, there's nothing much we can do here.
Yes, totally.
A few maybe 30 or 50. But that risks the runtime being overloaded with the preprocessing we are doing. |
|
@fchollet let me know how you'd want us to proceed from here. |
|
@sayakpaul I'll let you make the call with regard to how many samples to use -- more data is better, but we should try to keep the runtime below 30 min. Up to you at this point! A random thought: by using a small/faster CNN you'll be able to cut some processing time (but not too much if it's mostly OpenCV time). |
|
Thanks, @fchollet. Let me incorporate these points in the next commit. This PR has been an amazing learning experience for me. Thank you for pushing it forward :) |
|
@fchollet added changes to the example. Here is the summary:
Here's the full notebook. Let me know your thoughts. |
|
Fantastic! Everything is looking good. Please add the generated files 👍 |
|
Thanks for all your help and feedback, @fchollet. I have added the files. The visualization GIF was probably embedded in the notebook but if you find anything off let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the great contribution! I confirm the gif is being captured -- some guy shaving his beard.
|
Actually, there's a small issue -- it seems these copyedits were dropped. Can you please create a new PR to add them back? |
There are many subtleties for training a well-performing video classifier. Also, there are many ways to train one. This example walks through one of them. Hopefully, it will be helpful for the community.