Skip to content

Implement Real-Time Action Chunking (RTC) for SmolVLA #1521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 64 commits into
base: main
Choose a base branch
from

Conversation

ben-z
Copy link
Contributor

@ben-z ben-z commented Jul 17, 2025

What this does

This PR implements Real-Time Action Chunking (RTC) for SmolVLA. This greatly improves the smoothness of async inference compared to the existing methods.

This PR builds on top of #1514 (which builds on #1486, which builds on #1480), which is experimental itself. But I believe we will figure out a way to polish these PRs one-by-one. Please review those PRs first!

New SmolVLA parameters:

  • inference_enable_rtc: Whether to enable RTC
  • inference_rtc_d: Inference delay in ticks. If unset (-1), automatically determine based on the round trip inference delay. This determines the number of steps to determine the hard mask.
  • inference_rtc_soft_mask_length: The length of the soft mask (blend new chunk with old chunk) in ticks. If unset (-1), automatically determine based on the completion progress of the previous chunk. This parameter is not configurable in the paper (defaults to be automatically determined), but I observed that decreasing the soft mask length to a small proportion of the chunk size (e.g. set to 15 for chunk size 50) improves the responsiveness without compromising on the smoothness.
  • inference_rtc_beta: maximum guidance weight. See paper for details. I just set it to 5, which is what that the paper authors found to be good.
  • inference_rtc_debug: Add debug printing, at the cost of slower denoising process.

TODOs:

  • Finalize the async inference interface to support policy parameters (set once at startup) and runtime parameters (can change for each inference, e.g. min/mean/max roundtrip inference latency. Perhaps the client can have some logic to send stats like roundtrip latency to the server. Or the server could infer it using timestamps).
  • Currently, this PR probably breaks the async inference pipeline of other models (e.g. ACT, pi0) due to the need to pass in additional runtime parameters for RTC. This can be fixed with a small refactor.
  • There are many layers of function calls between the client and the function that executes the policy. This was fine when we didn't have any runtime parameters. Now that we do, we may want to refactor the policy server or the grpc interface to better support this use case.
  • For some reason, the dt parameter in the SmolVLA model is negative, and the denoising function is written in a slightly different way than the RTC paper. Will need to revisit the SmolVLA paper to see which convention we should adopt.
  • Add demo videos
  • Add tests

How it was tested

  1. Tested the core denoising function manually to make sure the implementation behaves correctly (e.g. hard/soft masking behave as expected). This was done during development, just by running the SmolVLA model.
  2. Tested the async functionality with and without RTC. There's a pretty clear difference between the two.

https://x.com/un1c0rnioz/status/1946128982262579460

RTC.vs.forward.mixing.mov

The implementation has 1 additional configurable parameter, inference_rtc_soft_mask_length, on top of the paper, to make soft masking horizon configurable and alleviate this issue:

RTC.end_s.75.vs.end_s.s.stuck.mov

How to checkout & try? (for the reviewer)

This PR is under development, please read the code first! The instructions for running async inference mostly apply, but this PR contains some interface changes to allow policy configuration passthrough. Here's the command I'm testing with, for reference:

HF_USER=$(huggingface-cli whoami | head -n 1)
echo "Hugging Face user: $HF_USER"
python lerobot/scripts/server/robot_client.py  \
  --server_address=127.0.0.1:18080 \
  --robot.type=so101_follower \
  --robot.port=$F1_PORT \
  --robot.cameras="${CAMERA_CONFIG}" \
  --robot.id=f1 \
  --policy.path=${HF_USER}/smolvla_so101_die_mat4_b64_lr5e-4_cs100_nas100_robo_200000 \
  --task="Grasp the die and put it on the mat." \
  --policy.device=cuda \
  --policy.compile_model=true \
  --actions_per_chunk=100 \
  --chunk_size_threshold=1.0 \
  --aggregate_fn_name=latest_only \
  --debug_visualize_queue_size=true \
  --policy.inference_enable_rtc=true \
  --policy.inference_rtc_d=15 \
  --policy.inference_rtc_soft_mask_length=15

@helper2424
Copy link
Contributor

Niice - RTC, is cool thing

@pkooij pkooij added the policies Items related to robot policies label Jul 17, 2025
helper2424 and others added 19 commits July 18, 2025 15:12
Co-authored-by: Ben Zhang <ben.zhang@uwaterloo.ca>
It's easy to get stuck in an infinite update loop, since the check for whether we have hit the chunk threshold is at the end of the control loop and the queue merge logic is at the beginning of the control loop.

One way to fix this is to move the queue update logic and the observation gathering step to the policy client process
… chunk size

For some reason, 1MiB chunks send faster than 2MiB chunks when sending ~2.5MiB of data.
ben-z and others added 2 commits July 20, 2025 04:11
Traceback (most recent call last):
  File /Users/ben/Projects/lerobot/src/lerobot/scripts/server/robot_client.py, line 80, in <module>
    from lerobot.transport.utils import grpc_channel_options, send_bytes_in_chunks
  File /Users/ben/Projects/lerobot/src/lerobot/transport/utils.py, line 75, in <module>
    def receive_bytes_in_chunks(iterator, queue: Queue | None, shutdown_event: Event, log_prefix: str = ):  # ruff: noqa
TypeError: unsupported operand type(s) for |: 'method' and 'NoneType'

https://www.reddit.com/r/learnpython/comments/1jaar9e/comment/mhltq2t/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
@ben-z
Copy link
Contributor Author

ben-z commented Jul 20, 2025

Rebased this PR onto #1480. Once that’s merged, the diff here should shrink. I made some changes to the policy server to support RTC; while we could split those out, some are interleaved with RTC updates in the same commits, so keeping them here might be simpler. If we go with this approach, we can close #1486 and #1514.

@atharva-18
Copy link

Hey, do we have any plans to implement RTC for diffusion as well?

@helper2424
Copy link
Contributor

@atharva-18 Hey, I am doing another PR with RTC for pi0 & Smovla - here #1698.

@ben-z made great job - I want to use this PR and introduce RTC before changing in the network architecture, because it erqures extending Proto interfaces in advance. (Of course will add him as co-author everywhere).

Regarding the original question - RTC originally works only with flow matching. I am not 100% familiar with diffusions on the low level, but I am not sure that it will be straitforward impementation. Probably it will required another technique to interpolate chunks between each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policies Items related to robot policies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants