Skip to content

Running on Windows with 24Gb VRAM #6

Closed
@ipeevski

Description

@ipeevski

Just writing to share my experience - perhaps it could help someone and maybe update the docs/requirements

Requirements:

  • Had to update deepspeed==0.12.2 to deepspeed==0.3.16 (because previous versions wouldn't compile on windows)
  • Ended up running torch version 2.5.1+cu124

Then had to make a few tweaks to the sample code to get it to run.
I have two GPUs, and for some reason, it always tried to run on the one with the less memory, so had to fix the GPU usage

so added explicit device here (in the sample code). Also loaded the model directly from hugging face, instead of having to download it first (by specifying the correct model name):

tokenizer, model, image_processor, context_len = load_pretrained_model(
    "jadechoghari/LongVU_Qwen2_7B",
    model_base=None,
    model_name="cambrian_qwen",
    device="cuda:0"
)

On 24Gb VRAM, I found that I had to limit videos to about 1000 frames (around 30 seconds) for it to work (otherwise running out of memory). There might be a way to offload / quant the models to work with more, but this was my dirty work around to deal with that when loading frame_indeces:

num_frames = 1000 if len(vr) > 1000 else len(vr)
frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])

Next, I had to add attention_mask to generate() (also increased the max_new_tokens so the description wasn't always cut off)

attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        images=video,
        image_sizes=image_sizes,
        do_sample=True,
        temperature=0.2,
        pad_token_id=tokenizer.eos_token_id,
        max_new_tokens=512,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )

After that, it pretty much just ran. It was quite quick too. Less than a minute and the majority of that was loading the model (from a not so fast disk). It can definitely be used in near real time if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions