Skip to content

Conversation

@daikw
Copy link

@daikw daikw commented Nov 4, 2025

Summary

Fixes an IndexError that occurs when using --resume with --dataset.video_encoding_batch_size > 1 during dataset recording.

Problem

When resuming recording with batch encoding enabled, the in-memory self.meta.episodes dataset was not being updated with newly recorded episodes. This caused an IndexError when trying to access episode metadata during batch encoding:

IndexError: Invalid key: 2 is out of bounds for size 2

The issue occurred because:

  1. New episodes were saved to parquet files
  2. self.meta.total_episodes was updated
  3. But self.meta.episodes (HF Dataset) remained stale
  4. Batch encoding tried to access episodes beyond the original size

Solution

Reload the episodes metadata at the start of _batch_save_episode_video() to ensure all episode data is available before accessing episode indices.

This follows the same pattern already used in line 1199 where episodes are reloaded when switching to a new chunk/file.

Test Plan

The fix should be tested by:

  1. Recording a dataset with batch encoding disabled or with small batch size
  2. Resuming recording with --resume=true and larger --dataset.video_encoding_batch_size
  3. Verifying that batch encoding completes without IndexError

Example command that previously failed:

uv run lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM1 \
  --teleop.type=so101_leader \
  --teleop.port=/dev/ttyACM0 \
  --dataset.repo_id=test/dataset \
  --dataset.video_encoding_batch_size=10 \
  --resume=true

🤖 Generated with Claude Code

When using --resume with batch encoding enabled, the in-memory
self.meta.episodes dataset was not being updated with newly recorded
episodes. This caused an IndexError when trying to access episode
metadata during batch encoding.

The issue occurred because:
1. New episodes were saved to parquet files
2. self.meta.total_episodes was updated
3. But self.meta.episodes (HF Dataset) remained stale
4. Batch encoding tried to access episodes beyond the original size

This fix reloads the episodes metadata at the start of
_batch_save_episode_video() to ensure all episode data is available.

Fixes IndexError: Invalid key: X is out of bounds for size Y

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings November 4, 2025 08:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an IndexError that occurs when resuming dataset recording with batch video encoding enabled. The issue was caused by stale episode metadata not being synchronized with newly recorded episodes when using the --resume flag.

  • Adds a metadata reload operation before batch video encoding
  • Ensures self.meta.episodes contains all available episodes before accessing them by index
  • Follows existing pattern used elsewhere in the codebase for metadata synchronization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…h encoding

The previous fix to reload episodes before batch encoding was incomplete.
When batch encoding is triggered, the metadata buffer may not have been
flushed to disk yet, causing load_episodes() to fail with a NoneType error.

This fix ensures that:
1. Metadata buffer is flushed to disk before attempting to reload
2. Episodes are reloaded to get the latest metadata
3. Batch encoding can proceed with complete episode information

Fixes: TypeError: 'NoneType' object is not subscriptable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@daikw daikw marked this pull request as draft November 4, 2025 08:33
daikw and others added 4 commits November 4, 2025 17:41
…encoding

The ParquetWriter buffers data and only writes complete files when closed.
Simply flushing the buffer is not sufficient - the file remains incomplete
and cannot be read by PyArrow, resulting in:
  "Parquet magic bytes not found in footer"

This fix:
1. Calls _close_writer() instead of _flush_metadata_buffer()
2. Ensures the ParquetWriter is properly closed and data is fully written
3. A new writer will be created on the next metadata write operation

Fixes: pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
When using batch encoding, temporary images must be preserved until
batch encoding completes. The previous code deleted images immediately
after each episode, causing FileNotFoundError when batch encoding tried
to access them.

This fix:
1. Skip image deletion in save_episode() when using batch encoding
2. Delete images after each episode's video is encoded in batch mode
3. Ensures images are available for batch encoding while cleaning up afterward

Fixes: FileNotFoundError: No images found in .../episode-000000

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The resume logic in _save_episode_video() was checking only for the
existence of episodes, not whether those episodes had video metadata.

In batch encoding scenarios:
1. Episodes 0-9 are recorded with metadata (no video metadata yet)
2. Batch encoding starts and reloads episodes
3. _save_episode_video(video_key, 0) is called
4. episode_index == 0, so it enters the first-episode branch
5. self.meta.episodes exists and has length > 0
6. Code tries to access videos/{video_key}/chunk_index
7. KeyError: this key doesn't exist yet (videos not encoded)

This fix adds a check to verify that video metadata actually exists
before treating it as a resume case. This prevents KeyError when
batch encoding a new dataset or episodes without prior video metadata.

Fixes: KeyError: 'videos/observation.images.overhead/chunk_index'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Two critical fixes for batch encoding reliability:

1. **Defer image cleanup to after successful batch encoding**
   Problem: Images were deleted inside _batch_save_episode_video() loop,
   but if an exception occurred after encoding (e.g., during parquet save),
   the retry in VideoEncodingManager.__exit__ would fail with FileNotFoundError.

   Solution: Move image cleanup to save_episode() and VideoEncodingManager.__exit__,
   ensuring cleanup happens only after the entire batch encoding succeeds.
   This allows retries to access the images if needed.

2. **Add null check for video metadata values**
   Problem: Checking only for key existence wasn't sufficient - the key
   can exist in the parquet schema but have NULL values, causing:
   "TypeError: unsupported operand type(s) for +=: 'NoneType' and 'int'"

   Solution: Add explicit check that video metadata values are not None
   before treating as resume case.

Fixes:
- FileNotFoundError: No images found during batch encoding retry
- TypeError in update_chunk_file_indices with NoneType

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant