Skip to content

Add stream_index seek mode, read frame index and update metadata #764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Dan-Flores
Copy link
Contributor

This PR works towards enabling decoders to receive an external frame index, as described in #466. It is only enabled in the private C++ code, and not exposed in the Python yet.

The changes:

  • The stream index is expected to be passed into add_video_stream in the format of 3 tensors in a tuple.
    • The reason it is not passed into the SingleStreamDecoder constructor (as discussed in Allow receiving an external frame index #466) is that a stream_index needs to be specified to update the decoder's metadata from the ffprobe output, and that stream_index is explicitly available during the add_video_stream call.
  • The metadata updates happen in the added function readFrameIndexUpdateMetadataAndIndex, which parallels the function used by exact mode, scanFileAndUpdateMetadataAndIndex.
    • It assumes PTS ints are passed in, since that is the expected type used in FrameInfo.

Testing:

  • To create a frame index in the format needed, a newframe_index property is added to TestContainerFile, which runs, parses, and sorts the ffprobe output.
  • I updated test_get_metadata to allow the new seek mode to work.
  • I added test_seek_mode_frame_index, which tests a variety of frame grabbing functions in the new seek mode.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@@ -82,7 +79,7 @@ def test_get_metadata(metadata_getter):
assert best_video_stream_metadata.begin_stream_seconds_from_header == 0
assert best_video_stream_metadata.bit_rate == 128783
assert best_video_stream_metadata.average_fps == pytest.approx(29.97, abs=0.001)
assert best_video_stream_metadata.pixel_aspect_ratio is None
assert best_video_stream_metadata.pixel_aspect_ratio == Fraction(1, 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since add_video_stream is always being called, this value is set instead of None.

@@ -266,15 +269,20 @@ void add_video_stream(
std::optional<int64_t> num_threads = std::nullopt,
std::optional<std::string_view> dimension_order = std::nullopt,
std::optional<int64_t> stream_index = std::nullopt,
std::optional<std::string_view> device = std::nullopt) {
std::optional<std::string_view> device = std::nullopt,
std::optional<std::string_view> color_conversion_library = std::nullopt,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to the main change in this PR, but I added the color_conversion_library argument, since it is available in _add_video_stream, but not add_video_stream.

Currently, tests utilize _add_video_stream whenever color_conversion_library is needed. Please let me know if this was intentional, and I can remove this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional :)

We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream(), with two (!) leading underscores.

IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library to the "still private but slightly more public" _core.add_video_stream() API.

: INT64_MAX;
streamInfos_[streamIndex].allFrames.push_back(frameInfo);
}
readFrameIndex_ = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure readFrameIndex_ is necessary, as there should be an error if add_video_stream is called more than once, but I added this check for consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on that just below

@Dan-Flores Dan-Flores marked this pull request as ready for review July 10, 2025 04:53
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Dan-Flores , this looks great! I took a look at the code, I haven't checked the tests yet.

test/utils.py Outdated
@@ -121,6 +122,7 @@ class TestContainerFile:
default_stream_index: int
stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]]
frames: Dict[int, Dict[int, TestFrameInfo]]
frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is causing errors on 3.9:

test/utils.py:127: in TestContainerFile
    frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None
E   TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'NoneType'

I think we'd have to do from __future__ import annotations for | to be a valid 3.9 typing operator. But in general we've just been using Optional when something can be None.

Personally, I prefer the way you've done it here. Optional is IMO a bad name, because it conflicts with the notion of "optional parameter", i.e. a parameter which can have a default. That's very different from a parameter that can be None, which is what Optional defines. In any case, we should probably just keep using Optional here, and move towards Union or | everywhere in the code base at some point.

@@ -319,6 +319,45 @@ void SingleStreamDecoder::scanFileAndUpdateMetadataAndIndex() {
scannedAllStreams_ = true;
}

void SingleStreamDecoder::readFrameIndexUpdateMetadataAndIndex(
int streamIndex,
std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we're currently using both at::Tensor and torch::Tensor throughout the codebase, we should probably align on one single use. torch::Tensor seems to be used more often:

~/dev/torchcodec (seek_mode_frame_index*) » git grep torch::Tensor | wc -l       
82
(codec) ---------------------------------------------------------------------------------------------------------------
~/dev/torchcodec (seek_mode_frame_index*) » git grep at::Tensor | wc -l      
38

Maybe just open an issue to follow-up on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #767 to follow up on this change seperately.

Comment on lines 325 to 327
if (readFrameIndex_) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling readFrameIndexUpdateMetadataAndIndex twice with different frameIndex values might lead to unexpected results, because the frame index won't be updated on the second call.

As you mentioned in your comment, our code never actually calls readFrameIndexUpdateMetadataAndIndex twice, so this sequence of calls shouldn't occur in practice. However, for someone new to the codebase, it might be surprising this early return. To address this, I suggest either:

  • Removing the readFrameIndex_ logic and allowing readFrameIndexUpdateMetadataAndIndex to be called multiple times, updating the index each time.
  • Raising a clear error if it's called more than once, instead of exiting early

: INT64_MAX;
streamInfos_[streamIndex].allFrames.push_back(frameInfo);
}
readFrameIndex_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on that just below

auto last_idx = all_frames.size(0) - 1;
streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>();
streamMetadata.endStreamPtsFromContent =
all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume -1 would work like it does in Python, but I'm not 100% sure:

Suggested change
all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>();
all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>();


auto& streamMetadata = containerMetadata_.allStreamMetadata[streamIndex];

// Get the last index for key_frames and duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the comment mentions key_frames but we're actually using all_frames.

Also, we should TORCH_CHECK here that all 3 tensors are of the same length!

// FrameInfo struct utilizes PTS
FrameInfo frameInfo = {all_frames[i].item<int64_t>()};
frameInfo.isKeyFrame =
(i < key_frames.size(0) && key_frames[i].item<int64_t>() == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The i < key_frames.size(0) check shouldn't be needed if we do the TORCH_CHECK validation mentioned above.
Also, I think key_frames should be a bool tensor, and I'd suggest to rename it to is_key_frame

// endStreamPtsSecondsFromContent
void readFrameIndexUpdateMetadataAndIndex(
int streamIndex,
std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here or somewhere else, we should document what are the expected lenfth, dtype, and associated semantic of each tensor.

We could also consider making FrameIndex a struct, instead of relying on a tuple.

@@ -266,15 +269,20 @@ void add_video_stream(
std::optional<int64_t> num_threads = std::nullopt,
std::optional<std::string_view> dimension_order = std::nullopt,
std::optional<int64_t> stream_index = std::nullopt,
std::optional<std::string_view> device = std::nullopt) {
std::optional<std::string_view> device = std::nullopt,
std::optional<std::string_view> color_conversion_library = std::nullopt,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional :)

We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream(), with two (!) leading underscores.

IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library to the "still private but slightly more public" _core.add_video_stream() API.

@@ -29,7 +29,7 @@ class SingleStreamDecoder {
// CONSTRUCTION API
// --------------------------------------------------------------------------

enum class SeekMode { exact, approximate };
enum class SeekMode { exact, approximate, custom_frame_mappings };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we'll want to publicly expose a new seek_mode in Python, but I think for the C++ side this a reasonable approach. @scotts curious if you have any opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Right now, the SingleStreamDecoder's seekMode_ is what we use to determine if we should do a file scan or not:

if (seekMode_ == SeekMode::exact) {
scanFileAndUpdateMetadataAndIndex();
}

Since the point of this feature is to avoid a file scan while still maintaining seek accuracy, I think it makes sense for it to be a new seek mode on the C++ side. We can figure out what to do with the public API later.

test/test_ops.py Outdated
@@ -448,6 +448,61 @@ def test_frame_pts_equality(self):
)
assert pts_is_equal

def test_seek_mode_custom_frame_mappings_fails(self):
decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: here and in other places, use keyword arguments for extra clarity. We typically don't use kw for the path though (1st param), as it's very obvious what it is.

Suggested change
decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings")
decoder = create_from_file(str(NASA_VIDEO.path), seek_mode"custom_frame_mappings")

test/utils.py Outdated
stream_index = self.default_stream_index
if self.custom_frame_mappings_data is None:
self.get_custom_frame_mappings(stream_index)
return self.custom_frame_mappings_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I think there's a potential bug here: if custom_frame_mappings() is called twice e.g.

custom_frame_mappings(stream_index=0)
custom_frame_mappings(stream_index=1)

then the second call will still return the mapping of stream_index=0, because that's the one that was cached?

Perhaps the easiest way to fix this is would be to make custom_frame_mappings_data a dict, where the keys are the stream indices, and the values are the individual mappings. This is also how the other fields (frames, stream_infos, etc.) are implemented.


In any case, good job on the implementation overall, I like that we compute the mappings on the fly and cache them. This avoids us having to check those long files into the repo.

Copy link
Contributor Author

@Dan-Flores Dan-Flores Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this would cause a bug if repeatedly called. I'll update custom_frame_mappings_data to be a dict, and remove the @property decorator so that providing a stream_index is possible.

test/utils.py Outdated
custom_frame_mappings_data[1].append(frame["key_frame"])
custom_frame_mappings_data[2].append(float(frame["duration"]))

(pts_list, key_frame_list, duration_list) = custom_frame_mappings_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining custom_frame_mappings_data, let's just define the 3 individual lists? i.e.

pts_list = []
key_frame_list = []
duration_list = []
for frame in frames: ...

or even just create all 3 by list comprehension. It's 3X slower but I don't think we care.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.

@@ -121,6 +122,9 @@ class TestContainerFile:
default_stream_index: int
stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]]
frames: Dict[int, Dict[int, TestFrameInfo]]
custom_frame_mappings_data: Dict[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, since we should only access the mappins through the get_custom_frame_mappings() function, I'd suggest to "mark" that field as private:

Suggested change
custom_frame_mappings_data: Dict[
_custom_frame_mappings_data: Dict[

void readCustomFrameMappingsUpdateMetadataAndIndex(
int streamIndex,
std::tuple<at::Tensor, at::Tensor, at::Tensor> customFrameMappings);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just bumping #764 (comment) again, which may have been missed:

  • we should document what are the expected length, dtype, and associated semantic of each tensor.
  • I'd also recommend relying on a new FrameMappings struct instead of a 3-tuple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on defining a new FrameMappings struct: I think of the code in custom_ops.cpp as the bridge layer between the C++ logic and the Python logic. So that's the layer where we would turn a tuple of tensors into a struct. That way, in the C++ logic, we're (as much as possible) operating on proper types with meaningful field names.

As a point of comparison, we do something similar with batched output. SingleStreamDecoder returns a FrameBatchOutput struct, and the code in custom_ops.cpp turns that into a tuple of tensors.

streamMetadata.numFramesFromContent = all_frames.size(0);
for (int64_t i = 0; i < all_frames.size(0); ++i) {
// FrameInfo struct utilizes PTS
FrameInfo frameInfo = {all_frames[i].item<int64_t>()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with that kind of initialization, but it's not immediately obvious that it's setting the frameInfo.pts field. Let's set the field explicitly instead, for clarity.

frameInfo.nextPts = (i + 1 < all_frames.size(0))
? all_frames[i + 1].item<int64_t>()
: INT64_MAX;
streamInfos_[streamIndex].allFrames.push_back(frameInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure to update the streamInfos_[streamIndex].keyFrames index, if isKeyFrame is true!

frameInfo.isKeyFrame = (is_key_frame[i].item<bool>() == true);
frameInfo.nextPts = (i + 1 < all_frames.size(0))
? all_frames[i + 1].item<int64_t>()
: INT64_MAX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also set the frameInfo's frameIndex field. I think it should be i?
EDIT: it should actually be set after we are sure the sequence is sorted, see other comment below.

streamInfos_[streamIndex].allFrames.push_back(frameInfo);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize reading this that there is an additional design decision we'll have to make: whether we expect the index to be already sorted, or not.

In the existing scanFileAndUpdateMetadataAndIndex() function we are reading packets in order, and yet we are sorting them afterwards:

// Sort all frames by their pts.
for (auto& [streamIndex, streamInfo] : streamInfos_) {
std::sort(
streamInfo.keyFrames.begin(),
streamInfo.keyFrames.end(),
[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {
return frameInfo1.pts < frameInfo2.pts;
});
std::sort(
streamInfo.allFrames.begin(),
streamInfo.allFrames.end(),
[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) {
return frameInfo1.pts < frameInfo2.pts;
});

I suspect that frame mappings coming from ffprobe won't be ordered in general. I think we have the following options:

  • expect the input mapping to be sorted - that may not be a great UX
  • sort the mapping in Python - this is kinda what this PR is doing, by sorting the mappings in the tests - but we should remove that and have that logic within the code, not the tests
  • sort the mappings in C++

I think the simpler, safer choice is to sort in C++ and rely on the same sorting logic that we have in scanFileAndUpdateMetadataAndIndex(). Curious what your thoughts are @Dan-Flores @scotts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on sorting on the C++ side; we can do it quickly on the C++ side, and it's much nicer to users. We should extract the existing logic out to a utility function called only in this cpp file, and call it in both places.

I think we sort in the scan function because I think it's possible for the actual packets to be not in PTS order.

metadata_getter.keywords["seek_mode"] == "exact"
if isinstance(metadata_getter, functools.partial)
else False
@pytest.mark.parametrize("seek_mode", ["approximate", "exact", "custom_frame_mappings"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have removed get_container_metadata_from_header from the previous parametrization

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.


streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>();
streamMetadata.endStreamPtsFromContent =
all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that tensors on the C++ side also support negative indices! I'm so used to that being not allowed in C++ arrays and standard containers that I initially thought this was undefined behavior!


streamMetadata.endStreamPtsSecondsFromContent =
*streamMetadata.endStreamPtsFromContent * av_q2d(avStream->time_base);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use the ptsToSeconds() function here, but it looks like we're also not using in the scanning function. And we can probably better define ptsToSeconds() to use av_q2d(). Let's address that elsewhere; created #770 to track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants