Open
Description
Hello,
I see in the paper that default MLLM configs were largely used, but frame counts were increased where applicable.
Certain models such as LongVA appear to support video contexts up to 1000 frames, but only 128 are used in the benchmark. If models can handle the extra frame context, it seems like it could potentially help their performance.
What determines the frame counts?
Metadata
Metadata
Assignees
Labels
No labels