Description
Just writing to share my experience - perhaps it could help someone and maybe update the docs/requirements
Requirements:
- Had to update
deepspeed==0.12.2
todeepspeed==0.3.16
(because previous versions wouldn't compile on windows) - Ended up running torch version
2.5.1+cu124
Then had to make a few tweaks to the sample code to get it to run.
I have two GPUs, and for some reason, it always tried to run on the one with the less memory, so had to fix the GPU usage
so added explicit device here (in the sample code). Also loaded the model directly from hugging face, instead of having to download it first (by specifying the correct model name):
tokenizer, model, image_processor, context_len = load_pretrained_model(
"jadechoghari/LongVU_Qwen2_7B",
model_base=None,
model_name="cambrian_qwen",
device="cuda:0"
)
On 24Gb VRAM, I found that I had to limit videos to about 1000 frames (around 30 seconds) for it to work (otherwise running out of memory). There might be a way to offload / quant the models to work with more, but this was my dirty work around to deal with that when loading frame_indeces
:
num_frames = 1000 if len(vr) > 1000 else len(vr)
frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])
Next, I had to add attention_mask
to generate()
(also increased the max_new_tokens
so the description wasn't always cut off)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
attention_mask=attention_mask,
images=video,
image_sizes=image_sizes,
do_sample=True,
temperature=0.2,
pad_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
After that, it pretty much just ran. It was quite quick too. Less than a minute and the majority of that was loading the model (from a not so fast disk). It can definitely be used in near real time if needed.