-
Notifications
You must be signed in to change notification settings - Fork 3k
fix(dataset): reload episodes metadata before batch video encoding #2379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When using --resume with batch encoding enabled, the in-memory self.meta.episodes dataset was not being updated with newly recorded episodes. This caused an IndexError when trying to access episode metadata during batch encoding. The issue occurred because: 1. New episodes were saved to parquet files 2. self.meta.total_episodes was updated 3. But self.meta.episodes (HF Dataset) remained stale 4. Batch encoding tried to access episodes beyond the original size This fix reloads the episodes metadata at the start of _batch_save_episode_video() to ensure all episode data is available. Fixes IndexError: Invalid key: X is out of bounds for size Y 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an IndexError that occurs when resuming dataset recording with batch video encoding enabled. The issue was caused by stale episode metadata not being synchronized with newly recorded episodes when using the --resume flag.
- Adds a metadata reload operation before batch video encoding
- Ensures
self.meta.episodescontains all available episodes before accessing them by index - Follows existing pattern used elsewhere in the codebase for metadata synchronization
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…h encoding The previous fix to reload episodes before batch encoding was incomplete. When batch encoding is triggered, the metadata buffer may not have been flushed to disk yet, causing load_episodes() to fail with a NoneType error. This fix ensures that: 1. Metadata buffer is flushed to disk before attempting to reload 2. Episodes are reloaded to get the latest metadata 3. Batch encoding can proceed with complete episode information Fixes: TypeError: 'NoneType' object is not subscriptable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…encoding The ParquetWriter buffers data and only writes complete files when closed. Simply flushing the buffer is not sufficient - the file remains incomplete and cannot be read by PyArrow, resulting in: "Parquet magic bytes not found in footer" This fix: 1. Calls _close_writer() instead of _flush_metadata_buffer() 2. Ensures the ParquetWriter is properly closed and data is fully written 3. A new writer will be created on the next metadata write operation Fixes: pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
When using batch encoding, temporary images must be preserved until batch encoding completes. The previous code deleted images immediately after each episode, causing FileNotFoundError when batch encoding tried to access them. This fix: 1. Skip image deletion in save_episode() when using batch encoding 2. Delete images after each episode's video is encoded in batch mode 3. Ensures images are available for batch encoding while cleaning up afterward Fixes: FileNotFoundError: No images found in .../episode-000000 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The resume logic in _save_episode_video() was checking only for the
existence of episodes, not whether those episodes had video metadata.
In batch encoding scenarios:
1. Episodes 0-9 are recorded with metadata (no video metadata yet)
2. Batch encoding starts and reloads episodes
3. _save_episode_video(video_key, 0) is called
4. episode_index == 0, so it enters the first-episode branch
5. self.meta.episodes exists and has length > 0
6. Code tries to access videos/{video_key}/chunk_index
7. KeyError: this key doesn't exist yet (videos not encoded)
This fix adds a check to verify that video metadata actually exists
before treating it as a resume case. This prevents KeyError when
batch encoding a new dataset or episodes without prior video metadata.
Fixes: KeyError: 'videos/observation.images.overhead/chunk_index'
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Two critical fixes for batch encoding reliability: 1. **Defer image cleanup to after successful batch encoding** Problem: Images were deleted inside _batch_save_episode_video() loop, but if an exception occurred after encoding (e.g., during parquet save), the retry in VideoEncodingManager.__exit__ would fail with FileNotFoundError. Solution: Move image cleanup to save_episode() and VideoEncodingManager.__exit__, ensuring cleanup happens only after the entire batch encoding succeeds. This allows retries to access the images if needed. 2. **Add null check for video metadata values** Problem: Checking only for key existence wasn't sufficient - the key can exist in the parquet schema but have NULL values, causing: "TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int'" Solution: Add explicit check that video metadata values are not None before treating as resume case. Fixes: - FileNotFoundError: No images found during batch encoding retry - TypeError in update_chunk_file_indices with NoneType 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary
Fixes an
IndexErrorthat occurs when using--resumewith--dataset.video_encoding_batch_size > 1during dataset recording.Problem
When resuming recording with batch encoding enabled, the in-memory
self.meta.episodesdataset was not being updated with newly recorded episodes. This caused anIndexErrorwhen trying to access episode metadata during batch encoding:The issue occurred because:
self.meta.total_episodeswas updatedself.meta.episodes(HF Dataset) remained staleSolution
Reload the episodes metadata at the start of
_batch_save_episode_video()to ensure all episode data is available before accessing episode indices.This follows the same pattern already used in line 1199 where episodes are reloaded when switching to a new chunk/file.
Test Plan
The fix should be tested by:
--resume=trueand larger--dataset.video_encoding_batch_sizeExample command that previously failed:
🤖 Generated with Claude Code