Add stream_index seek mode, read frame index and update metadata #764

Dan-Flores · 2025-07-10T04:34:17Z

This PR works towards enabling decoders to receive an external frame index, as described in #466. It is only enabled in the private C++ code, and not exposed in the Python yet.

The changes:

The stream index is expected to be passed into add_video_stream in the format of 3 tensors in a tuple.
- The reason it is not passed into the SingleStreamDecoder constructor (as discussed in Allow receiving an external frame index #466) is that a stream_index needs to be specified to update the decoder's metadata from the ffprobe output, and that stream_index is explicitly available during the add_video_stream call.
The metadata updates happen in the added function readFrameIndexUpdateMetadataAndIndex, which parallels the function used by exact mode, scanFileAndUpdateMetadataAndIndex.
- It assumes PTS ints are passed in, since that is the expected type used in FrameInfo.

Testing:

To create a frame index in the format needed, a newframe_index property is added to TestContainerFile, which runs, parses, and sorts the ffprobe output.
I updated test_get_metadata to allow the new seek mode to work.
I added test_seek_mode_frame_index, which tests a variety of frame grabbing functions in the new seek mode.

Dan-Flores · 2025-07-10T04:52:19Z

test/test_metadata.py

@@ -82,7 +79,7 @@ def test_get_metadata(metadata_getter):
    assert best_video_stream_metadata.begin_stream_seconds_from_header == 0
    assert best_video_stream_metadata.bit_rate == 128783
    assert best_video_stream_metadata.average_fps == pytest.approx(29.97, abs=0.001)
-    assert best_video_stream_metadata.pixel_aspect_ratio is None
+    assert best_video_stream_metadata.pixel_aspect_ratio == Fraction(1, 1)


Since add_video_stream is always being called, this value is set instead of None.

Dan-Flores · 2025-07-10T04:52:57Z

src/torchcodec/_core/custom_ops.cpp

@@ -266,15 +269,20 @@ void add_video_stream(
    std::optional<int64_t> num_threads = std::nullopt,
    std::optional<std::string_view> dimension_order = std::nullopt,
    std::optional<int64_t> stream_index = std::nullopt,
-    std::optional<std::string_view> device = std::nullopt) {
+    std::optional<std::string_view> device = std::nullopt,
+    std::optional<std::string_view> color_conversion_library = std::nullopt,


This is not related to the main change in this PR, but I added the color_conversion_library argument, since it is available in _add_video_stream, but not add_video_stream.

Currently, tests utilize _add_video_stream whenever color_conversion_library is needed. Please let me know if this was intentional, and I can remove this change.

This was intentional :)

We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream(), with two (!) leading underscores.

IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library to the "still private but slightly more public" _core.add_video_stream() API.

Dan-Flores · 2025-07-10T04:53:20Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+        : INT64_MAX;
+    streamInfos_[streamIndex].allFrames.push_back(frameInfo);
+  }
+  readFrameIndex_ = true;


I'm not sure readFrameIndex_ is necessary, as there should be an error if add_video_stream is called more than once, but I added this check for consistency.

I left a comment on that just below

NicolasHug

Thanks for the PR @Dan-Flores , this looks great! I took a look at the code, I haven't checked the tests yet.

NicolasHug · 2025-07-10T09:16:35Z

test/utils.py

@@ -121,6 +122,7 @@ class TestContainerFile:
    default_stream_index: int
    stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]]
    frames: Dict[int, Dict[int, TestFrameInfo]]
+    frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None


This line is causing errors on 3.9:

test/utils.py:127: in TestContainerFile frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None E TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'NoneType'

I think we'd have to do from __future__ import annotations for | to be a valid 3.9 typing operator. But in general we've just been using Optional when something can be None.

Personally, I prefer the way you've done it here. Optional is IMO a bad name, because it conflicts with the notion of "optional parameter", i.e. a parameter which can have a default. That's very different from a parameter that can be None, which is what Optional defines. In any case, we should probably just keep using Optional here, and move towards Union or | everywhere in the code base at some point.

NicolasHug · 2025-07-10T09:21:50Z

src/torchcodec/_core/SingleStreamDecoder.cpp

@@ -319,6 +319,45 @@ void SingleStreamDecoder::scanFileAndUpdateMetadataAndIndex() {
  scannedAllStreams_ = true;
 }

+void SingleStreamDecoder::readFrameIndexUpdateMetadataAndIndex(
+    int streamIndex,
+    std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex) {


Nit: we're currently using both at::Tensor and torch::Tensor throughout the codebase, we should probably align on one single use. torch::Tensor seems to be used more often:

~/dev/torchcodec (seek_mode_frame_index*) » git grep torch::Tensor | wc -l 82 (codec) --------------------------------------------------------------------------------------------------------------- ~/dev/torchcodec (seek_mode_frame_index*) » git grep at::Tensor | wc -l 38

Maybe just open an issue to follow-up on this?

I opened #767 to follow up on this change seperately.

NicolasHug · 2025-07-10T09:26:43Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+  if (readFrameIndex_) {
+    return;
+  }


Calling readFrameIndexUpdateMetadataAndIndex twice with different frameIndex values might lead to unexpected results, because the frame index won't be updated on the second call.

As you mentioned in your comment, our code never actually calls readFrameIndexUpdateMetadataAndIndex twice, so this sequence of calls shouldn't occur in practice. However, for someone new to the codebase, it might be surprising this early return. To address this, I suggest either:

Removing the readFrameIndex_ logic and allowing readFrameIndexUpdateMetadataAndIndex to be called multiple times, updating the index each time.

Raising a clear error if it's called more than once, instead of exiting early

NicolasHug · 2025-07-10T09:27:01Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+        : INT64_MAX;
+    streamInfos_[streamIndex].allFrames.push_back(frameInfo);
+  }
+  readFrameIndex_ = true;


I left a comment on that just below

NicolasHug · 2025-07-10T09:32:31Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+  auto last_idx = all_frames.size(0) - 1;
+  streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>();
+  streamMetadata.endStreamPtsFromContent =
+      all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>();


I would assume -1 would work like it does in Python, but I'm not 100% sure:

Suggested change

all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>();

all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>();

NicolasHug · 2025-07-10T09:37:37Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+
+  auto& streamMetadata = containerMetadata_.allStreamMetadata[streamIndex];
+
+  // Get the last index for key_frames and duration


Nit: the comment mentions key_frames but we're actually using all_frames.

Also, we should TORCH_CHECK here that all 3 tensors are of the same length!

NicolasHug · 2025-07-10T09:39:02Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+    // FrameInfo struct utilizes PTS
+    FrameInfo frameInfo = {all_frames[i].item<int64_t>()};
+    frameInfo.isKeyFrame =
+        (i < key_frames.size(0) && key_frames[i].item<int64_t>() == 1);


The i < key_frames.size(0) check shouldn't be needed if we do the TORCH_CHECK validation mentioned above.
Also, I think key_frames should be a bool tensor, and I'd suggest to rename it to is_key_frame

NicolasHug · 2025-07-10T09:43:37Z

src/torchcodec/_core/SingleStreamDecoder.h

+  // endStreamPtsSecondsFromContent
+  void readFrameIndexUpdateMetadataAndIndex(
+      int streamIndex,
+      std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex);


Here or somewhere else, we should document what are the expected lenfth, dtype, and associated semantic of each tensor.

We could also consider making FrameIndex a struct, instead of relying on a tuple.

NicolasHug · 2025-07-10T09:46:32Z

src/torchcodec/_core/custom_ops.cpp

@@ -266,15 +269,20 @@ void add_video_stream(
    std::optional<int64_t> num_threads = std::nullopt,
    std::optional<std::string_view> dimension_order = std::nullopt,
    std::optional<int64_t> stream_index = std::nullopt,
-    std::optional<std::string_view> device = std::nullopt) {
+    std::optional<std::string_view> device = std::nullopt,
+    std::optional<std::string_view> color_conversion_library = std::nullopt,


This was intentional :)

We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream(), with two (!) leading underscores.

IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library to the "still private but slightly more public" _core.add_video_stream() API.

… equal

NicolasHug · 2025-07-10T16:27:38Z

src/torchcodec/_core/SingleStreamDecoder.h

@@ -29,7 +29,7 @@ class SingleStreamDecoder {
  // CONSTRUCTION API
  // --------------------------------------------------------------------------

-  enum class SeekMode { exact, approximate };
+  enum class SeekMode { exact, approximate, custom_frame_mappings };


I don't know if we'll want to publicly expose a new seek_mode in Python, but I think for the C++ side this a reasonable approach. @scotts curious if you have any opinion on this?

Agreed. Right now, the SingleStreamDecoder's seekMode_ is what we use to determine if we should do a file scan or not:

torchcodec/src/torchcodec/_core/SingleStreamDecoder.cpp

Lines 183 to 185 in d444567

if (seekMode_ == SeekMode::exact) {

scanFileAndUpdateMetadataAndIndex();

}

Since the point of this feature is to avoid a file scan while still maintaining seek accuracy, I think it makes sense for it to be a new seek mode on the C++ side. We can figure out what to do with the public API later.

NicolasHug · 2025-07-10T16:30:26Z

test/test_ops.py

@@ -448,6 +448,61 @@ def test_frame_pts_equality(self):
            )
            assert pts_is_equal

+    def test_seek_mode_custom_frame_mappings_fails(self):
+        decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings")


Nit: here and in other places, use keyword arguments for extra clarity. We typically don't use kw for the path though (1st param), as it's very obvious what it is.

Suggested change

decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings")

decoder = create_from_file(str(NASA_VIDEO.path), seek_mode"custom_frame_mappings")

NicolasHug · 2025-07-10T16:41:18Z

test/utils.py

+            stream_index = self.default_stream_index
+        if self.custom_frame_mappings_data is None:
+            self.get_custom_frame_mappings(stream_index)
+        return self.custom_frame_mappings_data


I could be wrong but I think there's a potential bug here: if custom_frame_mappings() is called twice e.g.

custom_frame_mappings(stream_index=0) custom_frame_mappings(stream_index=1)

then the second call will still return the mapping of stream_index=0, because that's the one that was cached?

Perhaps the easiest way to fix this is would be to make custom_frame_mappings_data a dict, where the keys are the stream indices, and the values are the individual mappings. This is also how the other fields (frames, stream_infos, etc.) are implemented.

In any case, good job on the implementation overall, I like that we compute the mappings on the fly and cache them. This avoids us having to check those long files into the repo.

Good catch, this would cause a bug if repeatedly called. I'll update custom_frame_mappings_data to be a dict, and remove the @property decorator so that providing a stream_index is possible.

NicolasHug · 2025-07-10T16:44:20Z

test/utils.py

+            custom_frame_mappings_data[1].append(frame["key_frame"])
+            custom_frame_mappings_data[2].append(float(frame["duration"]))
+
+        (pts_list, key_frame_list, duration_list) = custom_frame_mappings_data


Instead of defining custom_frame_mappings_data, let's just define the 3 individual lists? i.e.

pts_list = [] key_frame_list = [] duration_list = [] for frame in frames: ...

or even just create all 3 by list comprehension. It's 3X slower but I don't think we care.

NicolasHug

Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.

NicolasHug · 2025-07-11T12:22:52Z

test/utils.py

@@ -121,6 +122,9 @@ class TestContainerFile:
    default_stream_index: int
    stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]]
    frames: Dict[int, Dict[int, TestFrameInfo]]
+    custom_frame_mappings_data: Dict[


Nit, since we should only access the mappins through the get_custom_frame_mappings() function, I'd suggest to "mark" that field as private:

Suggested change

custom_frame_mappings_data: Dict[

_custom_frame_mappings_data: Dict[

NicolasHug · 2025-07-11T12:33:21Z

src/torchcodec/_core/SingleStreamDecoder.h

+  void readCustomFrameMappingsUpdateMetadataAndIndex(
+      int streamIndex,
+      std::tuple<at::Tensor, at::Tensor, at::Tensor> customFrameMappings);
+


Just bumping #764 (comment) again, which may have been missed:

we should document what are the expected length, dtype, and associated semantic of each tensor.

I'd also recommend relying on a new FrameMappings struct instead of a 3-tuple.

To elaborate on defining a new FrameMappings struct: I think of the code in custom_ops.cpp as the bridge layer between the C++ logic and the Python logic. So that's the layer where we would turn a tuple of tensors into a struct. That way, in the C++ logic, we're (as much as possible) operating on proper types with meaningful field names.

As a point of comparison, we do something similar with batched output. SingleStreamDecoder returns a FrameBatchOutput struct, and the code in custom_ops.cpp turns that into a tuple of tensors.

NicolasHug · 2025-07-11T12:39:21Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+  streamMetadata.numFramesFromContent = all_frames.size(0);
+  for (int64_t i = 0; i < all_frames.size(0); ++i) {
+    // FrameInfo struct utilizes PTS
+    FrameInfo frameInfo = {all_frames[i].item<int64_t>()};


I'm not very familiar with that kind of initialization, but it's not immediately obvious that it's setting the frameInfo.pts field. Let's set the field explicitly instead, for clarity.

NicolasHug · 2025-07-11T12:39:57Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+    frameInfo.nextPts = (i + 1 < all_frames.size(0))
+        ? all_frames[i + 1].item<int64_t>()
+        : INT64_MAX;
+    streamInfos_[streamIndex].allFrames.push_back(frameInfo);


We should also make sure to update the streamInfos_[streamIndex].keyFrames index, if isKeyFrame is true!

NicolasHug · 2025-07-11T12:42:00Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+    frameInfo.isKeyFrame = (is_key_frame[i].item<bool>() == true);
+    frameInfo.nextPts = (i + 1 < all_frames.size(0))
+        ? all_frames[i + 1].item<int64_t>()
+        : INT64_MAX;


Let's also set the frameInfo's frameIndex field. I think it should be i?
EDIT: it should actually be set after we are sure the sequence is sorted, see other comment below.

NicolasHug · 2025-07-11T12:46:57Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+    streamInfos_[streamIndex].allFrames.push_back(frameInfo);
+  }
+}
+


I realize reading this that there is an additional design decision we'll have to make: whether we expect the index to be already sorted, or not.

In the existing scanFileAndUpdateMetadataAndIndex() function we are reading packets in order, and yet we are sorting them afterwards:

torchcodec/src/torchcodec/_core/SingleStreamDecoder.cpp

Lines 285 to 298 in 86e952f

// Sort all frames by their pts.

for (auto& [streamIndex, streamInfo] : streamInfos_) {

std::sort(

streamInfo.keyFrames.begin(),

streamInfo.keyFrames.end(),

[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {

return frameInfo1.pts < frameInfo2.pts;

});

std::sort(

streamInfo.allFrames.begin(),

streamInfo.allFrames.end(),

[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {

return frameInfo1.pts < frameInfo2.pts;

});

I suspect that frame mappings coming from ffprobe won't be ordered in general. I think we have the following options:

expect the input mapping to be sorted - that may not be a great UX

sort the mapping in Python - this is kinda what this PR is doing, by sorting the mappings in the tests - but we should remove that and have that logic within the code, not the tests

sort the mappings in C++

I think the simpler, safer choice is to sort in C++ and rely on the same sorting logic that we have in scanFileAndUpdateMetadataAndIndex(). Curious what your thoughts are @Dan-Flores @scotts ?

Agreed on sorting on the C++ side; we can do it quickly on the C++ side, and it's much nicer to users. We should extract the existing logic out to a utility function called only in this cpp file, and call it in both places.

I think we sort in the scan function because I think it's possible for the actual packets to be not in PTS order.

NicolasHug · 2025-07-11T12:51:28Z

test/test_metadata.py

-        metadata_getter.keywords["seek_mode"] == "exact"
-        if isinstance(metadata_getter, functools.partial)
-        else False
+@pytest.mark.parametrize("seek_mode", ["approximate", "exact", "custom_frame_mappings"])


I think you may have removed get_container_metadata_from_header from the previous parametrization

NicolasHug

Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.

scotts · 2025-07-11T15:44:53Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+
+  streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>();
+  streamMetadata.endStreamPtsFromContent =
+      all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>();


TIL that tensors on the C++ side also support negative indices! I'm so used to that being not allowed in C++ arrays and standard containers that I initially thought this was undefined behavior!

scotts · 2025-07-11T15:52:21Z

src/torchcodec/_core/SingleStreamDecoder.cpp

+
+  streamMetadata.endStreamPtsSecondsFromContent =
+      *streamMetadata.endStreamPtsFromContent * av_q2d(avStream->time_base);
+


We should probably use the ptsToSeconds() function here, but it looks like we're also not using in the scanning function. And we can probably better define ptsToSeconds() to use av_q2d(). Let's address that elsewhere; created #770 to track.

Daniel Flores added 2 commits July 10, 2025 00:05

Add stream_index seek mode, read frame index and update metadata

88cacb9

add get_frame tests for new seek mode

4dfc581

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025

lints

ed3fdec

Dan-Flores commented Jul 10, 2025

View reviewed changes

Dan-Flores marked this pull request as ready for review July 10, 2025 04:53

NicolasHug reviewed Jul 10, 2025

View reviewed changes

Daniel Flores added 5 commits July 10, 2025 08:52

replace union syntax with optional

6030f9e

remove color_conversion changes

1361c0d

rename is_key_frame, remove readFrameIndex_, check tensor lengths are…

5bb23c2

… equal

rename frame_index to custom_frame_mappings

6573943

lints

3341dd8

NicolasHug reviewed Jul 10, 2025

View reviewed changes

Daniel Flores added 4 commits July 10, 2025 15:41

Use CustomFrameMappings struct

e914160

use seek_mode keyword in test_ops

394ac70

Turn custom_frame_mappings_data into dict, use list comprehension

f13433c

Remove c++ struct CustomFrameMappings

8971bd6

NicolasHug reviewed Jul 11, 2025

View reviewed changes

scotts reviewed Jul 11, 2025

View reviewed changes

	all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>();
	all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>();


		auto& streamMetadata = containerMetadata_.allStreamMetadata[streamIndex];

		// Get the last index for key_frames and duration

	if (seekMode_ == SeekMode::exact) {
	scanFileAndUpdateMetadataAndIndex();
	}

	decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings")
	decoder = create_from_file(str(NASA_VIDEO.path), seek_mode"custom_frame_mappings")

	custom_frame_mappings_data: Dict[
	_custom_frame_mappings_data: Dict[

	// Sort all frames by their pts.
	for (auto& [streamIndex, streamInfo] : streamInfos_) {
	std::sort(
	streamInfo.keyFrames.begin(),
	streamInfo.keyFrames.end(),
	[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {
	return frameInfo1.pts < frameInfo2.pts;
	});
	std::sort(
	streamInfo.allFrames.begin(),
	streamInfo.allFrames.end(),
	[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {
	return frameInfo1.pts < frameInfo2.pts;
	});


		streamMetadata.endStreamPtsSecondsFromContent =
		streamMetadata.endStreamPtsFromContent av_q2d(avStream->time_base);

Add stream_index seek mode, read frame index and update metadata #764

Are you sure you want to change the base?

Add stream_index seek mode, read frame index and update metadata #764

Conversation

Dan-Flores commented Jul 10, 2025

The changes:

Testing:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dan-Flores Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Dan-Flores Jul 10, 2025 •

edited

Loading