-
Notifications
You must be signed in to change notification settings - Fork 46
Add stream_index seek mode, read frame index and update metadata #764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -82,7 +79,7 @@ def test_get_metadata(metadata_getter): | |||
assert best_video_stream_metadata.begin_stream_seconds_from_header == 0 | |||
assert best_video_stream_metadata.bit_rate == 128783 | |||
assert best_video_stream_metadata.average_fps == pytest.approx(29.97, abs=0.001) | |||
assert best_video_stream_metadata.pixel_aspect_ratio is None | |||
assert best_video_stream_metadata.pixel_aspect_ratio == Fraction(1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since add_video_stream
is always being called, this value is set instead of None
.
src/torchcodec/_core/custom_ops.cpp
Outdated
@@ -266,15 +269,20 @@ void add_video_stream( | |||
std::optional<int64_t> num_threads = std::nullopt, | |||
std::optional<std::string_view> dimension_order = std::nullopt, | |||
std::optional<int64_t> stream_index = std::nullopt, | |||
std::optional<std::string_view> device = std::nullopt) { | |||
std::optional<std::string_view> device = std::nullopt, | |||
std::optional<std::string_view> color_conversion_library = std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to the main change in this PR, but I added the color_conversion_library
argument, since it is available in _add_video_stream
, but not add_video_stream
.
Currently, tests utilize _add_video_stream whenever color_conversion_library
is needed. Please let me know if this was intentional, and I can remove this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional :)
We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream()
, with two (!) leading underscores.
IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library
to the "still private but slightly more public" _core.add_video_stream()
API.
: INT64_MAX; | ||
streamInfos_[streamIndex].allFrames.push_back(frameInfo); | ||
} | ||
readFrameIndex_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure readFrameIndex_
is necessary, as there should be an error if add_video_stream
is called more than once, but I added this check for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on that just below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @Dan-Flores , this looks great! I took a look at the code, I haven't checked the tests yet.
test/utils.py
Outdated
@@ -121,6 +122,7 @@ class TestContainerFile: | |||
default_stream_index: int | |||
stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]] | |||
frames: Dict[int, Dict[int, TestFrameInfo]] | |||
frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing errors on 3.9:
test/utils.py:127: in TestContainerFile
frame_index_data: tuple[torch.Tensor, torch.Tensor, torch.Tensor] | None = None
E TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'NoneType'
I think we'd have to do from __future__ import annotations
for |
to be a valid 3.9 typing operator. But in general we've just been using Optional
when something can be None.
Personally, I prefer the way you've done it here. Optional
is IMO a bad name, because it conflicts with the notion of "optional parameter", i.e. a parameter which can have a default. That's very different from a parameter that can be None
, which is what Optional
defines. In any case, we should probably just keep using Optional
here, and move towards Union
or |
everywhere in the code base at some point.
@@ -319,6 +319,45 @@ void SingleStreamDecoder::scanFileAndUpdateMetadataAndIndex() { | |||
scannedAllStreams_ = true; | |||
} | |||
|
|||
void SingleStreamDecoder::readFrameIndexUpdateMetadataAndIndex( | |||
int streamIndex, | |||
std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we're currently using both at::Tensor
and torch::Tensor
throughout the codebase, we should probably align on one single use. torch::Tensor
seems to be used more often:
~/dev/torchcodec (seek_mode_frame_index*) » git grep torch::Tensor | wc -l
82
(codec) ---------------------------------------------------------------------------------------------------------------
~/dev/torchcodec (seek_mode_frame_index*) » git grep at::Tensor | wc -l
38
Maybe just open an issue to follow-up on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #767 to follow up on this change seperately.
if (readFrameIndex_) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling readFrameIndexUpdateMetadataAndIndex
twice with different frameIndex
values might lead to unexpected results, because the frame index won't be updated on the second call.
As you mentioned in your comment, our code never actually calls readFrameIndexUpdateMetadataAndIndex
twice, so this sequence of calls shouldn't occur in practice. However, for someone new to the codebase, it might be surprising this early return. To address this, I suggest either:
- Removing the
readFrameIndex_
logic and allowingreadFrameIndexUpdateMetadataAndIndex
to be called multiple times, updating the index each time. - Raising a clear error if it's called more than once, instead of exiting early
: INT64_MAX; | ||
streamInfos_[streamIndex].allFrames.push_back(frameInfo); | ||
} | ||
readFrameIndex_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on that just below
auto last_idx = all_frames.size(0) - 1; | ||
streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>(); | ||
streamMetadata.endStreamPtsFromContent = | ||
all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume -1
would work like it does in Python, but I'm not 100% sure:
all_frames[last_idx].item<int64_t>() + duration[last_idx].item<int64_t>(); | |
all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>(); |
|
||
auto& streamMetadata = containerMetadata_.allStreamMetadata[streamIndex]; | ||
|
||
// Get the last index for key_frames and duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the comment mentions key_frames
but we're actually using all_frames
.
Also, we should TORCH_CHECK
here that all 3 tensors are of the same length!
// FrameInfo struct utilizes PTS | ||
FrameInfo frameInfo = {all_frames[i].item<int64_t>()}; | ||
frameInfo.isKeyFrame = | ||
(i < key_frames.size(0) && key_frames[i].item<int64_t>() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i < key_frames.size(0)
check shouldn't be needed if we do the TORCH_CHECK
validation mentioned above.
Also, I think key_frames
should be a bool
tensor, and I'd suggest to rename it to is_key_frame
// endStreamPtsSecondsFromContent | ||
void readFrameIndexUpdateMetadataAndIndex( | ||
int streamIndex, | ||
std::tuple<at::Tensor, at::Tensor, at::Tensor> frameIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here or somewhere else, we should document what are the expected lenfth, dtype, and associated semantic of each tensor.
We could also consider making FrameIndex
a struct, instead of relying on a tuple
.
src/torchcodec/_core/custom_ops.cpp
Outdated
@@ -266,15 +269,20 @@ void add_video_stream( | |||
std::optional<int64_t> num_threads = std::nullopt, | |||
std::optional<std::string_view> dimension_order = std::nullopt, | |||
std::optional<int64_t> stream_index = std::nullopt, | |||
std::optional<std::string_view> device = std::nullopt) { | |||
std::optional<std::string_view> device = std::nullopt, | |||
std::optional<std::string_view> color_conversion_library = std::nullopt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional :)
We wanted to make the library selection something very, very private, which is why we are only exposing it through _core._add_video_stream()
, with two (!) leading underscores.
IIRC, we still had to expose it to Python for testing reasons, but we shouldn't add color_conversion_library
to the "still private but slightly more public" _core.add_video_stream()
API.
@@ -29,7 +29,7 @@ class SingleStreamDecoder { | |||
// CONSTRUCTION API | |||
// -------------------------------------------------------------------------- | |||
|
|||
enum class SeekMode { exact, approximate }; | |||
enum class SeekMode { exact, approximate, custom_frame_mappings }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we'll want to publicly expose a new seek_mode
in Python, but I think for the C++ side this a reasonable approach. @scotts curious if you have any opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Right now, the SingleStreamDecoder
's seekMode_
is what we use to determine if we should do a file scan or not:
torchcodec/src/torchcodec/_core/SingleStreamDecoder.cpp
Lines 183 to 185 in d444567
if (seekMode_ == SeekMode::exact) { | |
scanFileAndUpdateMetadataAndIndex(); | |
} |
Since the point of this feature is to avoid a file scan while still maintaining seek accuracy, I think it makes sense for it to be a new seek mode on the C++ side. We can figure out what to do with the public API later.
test/test_ops.py
Outdated
@@ -448,6 +448,61 @@ def test_frame_pts_equality(self): | |||
) | |||
assert pts_is_equal | |||
|
|||
def test_seek_mode_custom_frame_mappings_fails(self): | |||
decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: here and in other places, use keyword arguments for extra clarity. We typically don't use kw for the path though (1st param), as it's very obvious what it is.
decoder = create_from_file(str(NASA_VIDEO.path), "custom_frame_mappings") | |
decoder = create_from_file(str(NASA_VIDEO.path), seek_mode"custom_frame_mappings") |
test/utils.py
Outdated
stream_index = self.default_stream_index | ||
if self.custom_frame_mappings_data is None: | ||
self.get_custom_frame_mappings(stream_index) | ||
return self.custom_frame_mappings_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong but I think there's a potential bug here: if custom_frame_mappings()
is called twice e.g.
custom_frame_mappings(stream_index=0)
custom_frame_mappings(stream_index=1)
then the second call will still return the mapping of stream_index=0
, because that's the one that was cached?
Perhaps the easiest way to fix this is would be to make custom_frame_mappings_data
a dict, where the keys are the stream indices, and the values are the individual mappings. This is also how the other fields (frames
, stream_infos
, etc.) are implemented.
In any case, good job on the implementation overall, I like that we compute the mappings on the fly and cache them. This avoids us having to check those long files into the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this would cause a bug if repeatedly called. I'll update custom_frame_mappings_data
to be a dict, and remove the @property
decorator so that providing a stream_index
is possible.
test/utils.py
Outdated
custom_frame_mappings_data[1].append(frame["key_frame"]) | ||
custom_frame_mappings_data[2].append(float(frame["duration"])) | ||
|
||
(pts_list, key_frame_list, duration_list) = custom_frame_mappings_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of defining custom_frame_mappings_data
, let's just define the 3 individual lists? i.e.
pts_list = []
key_frame_list = []
duration_list = []
for frame in frames: ...
or even just create all 3 by list comprehension. It's 3X slower but I don't think we care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.
@@ -121,6 +122,9 @@ class TestContainerFile: | |||
default_stream_index: int | |||
stream_infos: Dict[int, Union[TestVideoStreamInfo, TestAudioStreamInfo]] | |||
frames: Dict[int, Dict[int, TestFrameInfo]] | |||
custom_frame_mappings_data: Dict[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, since we should only access the mappins through the get_custom_frame_mappings()
function, I'd suggest to "mark" that field as private:
custom_frame_mappings_data: Dict[ | |
_custom_frame_mappings_data: Dict[ |
void readCustomFrameMappingsUpdateMetadataAndIndex( | ||
int streamIndex, | ||
std::tuple<at::Tensor, at::Tensor, at::Tensor> customFrameMappings); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just bumping #764 (comment) again, which may have been missed:
- we should document what are the expected length, dtype, and associated semantic of each tensor.
- I'd also recommend relying on a new
FrameMappings
struct instead of a 3-tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate on defining a new FrameMappings
struct: I think of the code in custom_ops.cpp
as the bridge layer between the C++ logic and the Python logic. So that's the layer where we would turn a tuple of tensors into a struct. That way, in the C++ logic, we're (as much as possible) operating on proper types with meaningful field names.
As a point of comparison, we do something similar with batched output. SingleStreamDecoder
returns a FrameBatchOutput
struct, and the code in custom_ops.cpp
turns that into a tuple of tensors.
streamMetadata.numFramesFromContent = all_frames.size(0); | ||
for (int64_t i = 0; i < all_frames.size(0); ++i) { | ||
// FrameInfo struct utilizes PTS | ||
FrameInfo frameInfo = {all_frames[i].item<int64_t>()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with that kind of initialization, but it's not immediately obvious that it's setting the frameInfo.pts
field. Let's set the field explicitly instead, for clarity.
frameInfo.nextPts = (i + 1 < all_frames.size(0)) | ||
? all_frames[i + 1].item<int64_t>() | ||
: INT64_MAX; | ||
streamInfos_[streamIndex].allFrames.push_back(frameInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also make sure to update the streamInfos_[streamIndex].keyFrames
index, if isKeyFrame is true!
frameInfo.isKeyFrame = (is_key_frame[i].item<bool>() == true); | ||
frameInfo.nextPts = (i + 1 < all_frames.size(0)) | ||
? all_frames[i + 1].item<int64_t>() | ||
: INT64_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also set the frameInfo
's frameIndex
field. I think it should be i
?
EDIT: it should actually be set after we are sure the sequence is sorted, see other comment below.
streamInfos_[streamIndex].allFrames.push_back(frameInfo); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize reading this that there is an additional design decision we'll have to make: whether we expect the index to be already sorted, or not.
In the existing scanFileAndUpdateMetadataAndIndex()
function we are reading packets in order, and yet we are sorting them afterwards:
torchcodec/src/torchcodec/_core/SingleStreamDecoder.cpp
Lines 285 to 298 in 86e952f
// Sort all frames by their pts. | |
for (auto& [streamIndex, streamInfo] : streamInfos_) { | |
std::sort( | |
streamInfo.keyFrames.begin(), | |
streamInfo.keyFrames.end(), | |
[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) { | |
return frameInfo1.pts < frameInfo2.pts; | |
}); | |
std::sort( | |
streamInfo.allFrames.begin(), | |
streamInfo.allFrames.end(), | |
[](const FrameInfo& frameInfo1, const FrameInfo& frameInfo2) { | |
return frameInfo1.pts < frameInfo2.pts; | |
}); |
I suspect that frame mappings coming from ffprobe
won't be ordered in general. I think we have the following options:
- expect the input mapping to be sorted - that may not be a great UX
- sort the mapping in Python - this is kinda what this PR is doing, by sorting the mappings in the tests - but we should remove that and have that logic within the code, not the tests
- sort the mappings in C++
I think the simpler, safer choice is to sort in C++ and rely on the same sorting logic that we have in scanFileAndUpdateMetadataAndIndex()
. Curious what your thoughts are @Dan-Flores @scotts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on sorting on the C++ side; we can do it quickly on the C++ side, and it's much nicer to users. We should extract the existing logic out to a utility function called only in this cpp file, and call it in both places.
I think we sort in the scan function because I think it's possible for the actual packets to be not in PTS order.
metadata_getter.keywords["seek_mode"] == "exact" | ||
if isinstance(metadata_getter, functools.partial) | ||
else False | ||
@pytest.mark.parametrize("seek_mode", ["approximate", "exact", "custom_frame_mappings"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may have removed get_container_metadata_from_header
from the previous parametrization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments @Dan-Flores , I made another pass! I think there's another design discussion we need to have, about whether we expect the mappings to be sorted.
|
||
streamMetadata.beginStreamPtsFromContent = all_frames[0].item<int64_t>(); | ||
streamMetadata.endStreamPtsFromContent = | ||
all_frames[-1].item<int64_t>() + duration[-1].item<int64_t>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL that tensors on the C++ side also support negative indices! I'm so used to that being not allowed in C++ arrays and standard containers that I initially thought this was undefined behavior!
|
||
streamMetadata.endStreamPtsSecondsFromContent = | ||
*streamMetadata.endStreamPtsFromContent * av_q2d(avStream->time_base); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use the ptsToSeconds()
function here, but it looks like we're also not using in the scanning function. And we can probably better define ptsToSeconds()
to use av_q2d()
. Let's address that elsewhere; created #770 to track.
This PR works towards enabling decoders to receive an external frame index, as described in #466. It is only enabled in the private C++ code, and not exposed in the Python yet.
The changes:
add_video_stream
in the format of 3 tensors in a tuple.stream_index
needs to be specified to update the decoder's metadata from the ffprobe output, and thatstream_index
is explicitly available during theadd_video_stream
call.readFrameIndexUpdateMetadataAndIndex
, which parallels the function used by exact mode,scanFileAndUpdateMetadataAndIndex
.FrameInfo
.Testing:
frame_index
property is added toTestContainerFile
, which runs, parses, and sorts the ffprobe output.test_get_metadata
to allow the new seek mode to work.test_seek_mode_frame_index
, which tests a variety of frame grabbing functions in the new seek mode.