Inference on video without extracting images #641

cjaverliat · 2025-05-03T16:40:33Z

Inference on video without extracting images

This PR proposes an alternative to the original SAM2Base, SAM2Generic, which provides new APIs. Additionally, I added a SAM2GenericVideoPredictor which is a re-implementation of the video predictor but with configurable strategies for memorization and removal of past memories (cf. here for an example), this solves the issue with keeping everything in the VRAM.

More importantly, this allows to run the prediction on videos without having to decode the frames to jpeg files before-hand:

import cv2
import torch
from tqdm import tqdm
from sam2.sam2_generic_video_predictor import Prompt
from sam2.build_sam import build_sam2_generic_video_predictor

sam2_checkpoint = "../checkpoints/sam2.1_hiera_base_plus.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"

predictor = build_sam2_generic_video_predictor(model_cfg, sam2_checkpoint, device=device)

cap = cv2.VideoCapture("./videos/bedroom.mp4")
n_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
orig_hw = (height, width)

def read_frame(cap) -> torch.Tensor:
    ret, frame = cap.read()
    if not ret:
        return None
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame = torch.as_tensor(frame).permute(2, 0, 1).to(device) # HWC -> CHW
    frame = frame / 255.0
    return frame
 
# Add a prompt on the first frame
initial_frame = read_frame(cap)
points_coords = torch.tensor([400.0, 150.0], device=device).reshape((1, 1, 2))
points_labels = torch.tensor([1], device=device).reshape((1, 1))
prompt = Prompt(obj_id=0, points_coords=points_coords, points_labels=points_labels)
results = predictor.forward(frame=initial_frame, object_prompts=[prompt])

for f in tqdm(range(1, n_frames)):
    frame = read_frame(cap)

    if frame is None:
        break

    results = predictor.forward(frame=frame)
    
    # Do something with the result, for example:
    #     show_mask((results[0].best_mask_logits > 0), plt.gca(), obj_id=0)

The full usage example is available in the generic_video_predictor_example.ipynb notebook.

…logits

…t methods in SAM2Result

… memory bank

cjaverliat added 5 commits May 3, 2025 13:37

Add SAM2Generic class

42c0c18

Variable renaming + docstring

3a4ce6e

Add device transfer for empty prompt embeddings

6ae45ca

Add generic video predictor

85255d7

Fix formatting

7431eb9

facebook-github-bot added the cla signed label May 3, 2025

cjaverliat marked this pull request as draft May 3, 2025 17:11

cjaverliat added 19 commits May 3, 2025 19:29

Add build_sam2_generic

54bfe74

Add autoscale when encoding uint8 images

df361a0

Update assertion in condition_image_embeddings_on_memories

86e16e0

Add SAM2Result containing masks_logits, ious, obj_ptrs and obj_score_…

ea32c69

…logits

Add SAM2Prompt containing points, boxes and masks prompts

66436f2

Add ObjectMemory, ObjectMemorySelection and ObjectMemoryBank

bceef89

Implement SAM2 existing memory bank using new classes

4a9d730

Remove unnecessary SAM2ObjectMemory

0ce42f9

Add SAM2Result concatenation method

e3db8ee

Add device property to SAM2Result

c65a20f

Rename obj_score_logits to obj_scores_logits + add __getitem__ and ca…

b97448f

…t methods in SAM2Result

Add missing implementation for data transfer in ObjectMemory

6657bed

Rename object_memories to ptr_memories

d197aa1

Update ObjectMemoryBank try_add_memories function

7b94cf9

Add method to count the number of stored memories in the memory bank

7dfdb9e

Modify sam2_generic and its video predictor to reflect changes on the…

3c4dcd8

… memory bank

Fix best_mask_logits indexing issue

1cb6a77

Reshape binarize to be broadcastable

53f5292

Update notebook

5b0ed20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference on video without extracting images #641

Inference on video without extracting images #641

cjaverliat commented May 3, 2025

Uh oh!

Uh oh!

Inference on video without extracting images #641

Are you sure you want to change the base?

Inference on video without extracting images #641

Conversation

cjaverliat commented May 3, 2025

Inference on video without extracting images

Uh oh!

Uh oh!