Skip to content

Fix VBot immediate termination on spawn#6

Draft
Copilot wants to merge 30 commits intomainfrom
copilot/fix-immediate-termination-bug
Draft

Fix VBot immediate termination on spawn#6
Copilot wants to merge 30 commits intomainfrom
copilot/fix-immediate-termination-bug

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 12, 2026

VBot robots were terminating within 1-5 steps of spawning (mean reward ~0.29 vs expected >2.0) due to four interrelated bugs in reset and termination logic.

Changes

1. Reset position generation

Before: Polar coordinate generation on outer circle (radius 3.0m) ignored cfg.init_state.pos
After: Use curriculum learning position with XY randomization only

# Old: spawn on outer circle, far from target
for i in range(num_envs):
    theta = np.random.uniform(0, 2 * np.pi)
    radius = cfg.arena_outer_radius + np.random.uniform(-0.1, 0.1)
    robot_init_xy[i] = [radius * np.cos(theta), radius * np.sin(theta)]

# New: spawn at curriculum position with small noise
base_pos = np.array(cfg.init_state.pos, dtype=np.float32)  # [0.0, 0.6, 0.35]
robot_init_pos = np.tile(base_pos, (num_envs, 1))
robot_init_pos[:, :2] += np.random.uniform([pr[0], pr[1]], [pr[2], pr[3]], (num_envs, 2))

Also: init_state.pos[2] 0.5→0.35m to reduce fall impact, boundary_radius 3.5→5.0m

2. Base contact termination

Before: Threshold 0.01 with no grace period → instant termination on landing
After: Threshold 0.1 with 50-step grace period

GRACE_STEPS = 50
current_steps = state.info.get("steps", np.zeros(num_envs, dtype=np.int32))
past_grace = current_steps > GRACE_STEPS
base_contact = (base_contact_value > 0.1).flatten()[:num_envs]  # was 0.01
terminated = np.logical_or(terminated, base_contact & past_grace)

Added steps increment in update_state(): state.info["steps"] = state.info["steps"] + 1

3. Orientation penalty

Before: Hard-coded -10.0 on landing (>45° tilt)
After: Progressive penalty capped at -3.0

# Old: extreme spike
reward = np.where(orientation_penalty > 0.5, reward - 10.0, reward)

# New: smooth gradient
extreme_tilt_penalty = np.clip((orientation_penalty - 0.5) * 5.0, 0.0, 3.0)
reward = reward - extreme_tilt_penalty

4. Info dict initialization

Added missing keys in reset():

  • "last_distance": distance_to_target.copy() - for progress reward calculation
  • "steps": np.zeros(num_envs, dtype=np.int32) - for grace period tracking

Testing

test_immediate_termination_fix.py validates all four fixes independently and in integration.

Original prompt

问题描述

VBot 训练中出现"刚出生就死"(Immediate Termination)现象。运行 uv run scripts/train.py --env vbot_navigation_section001 时,机器人在视频中"半空消失",奖励极低:

  • Total reward (max) ≈ 1.7-1.9(正常应远大于此)
  • Total reward (mean) ≈ 0.29(几乎为零)
  • Total reward (min) 从 -2.5 快速收敛到 -0.3

如图所示(TensorBoard 奖励曲线):

image1

根因分析

经过代码审查,发现 4 个关键 Bug:

Bug 1:reset() 中极坐标随机生成覆盖了课程学习的近距离起点

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L815-L828)

reset() 使用 arena_outer_radius = 3.0 在外圈极坐标随机生成初始位置,完全忽略了 cfg.init_state.pos = [0.0, 0.6, 0.5](课程学习第一阶段的近距离起点)。同时 target_point_a = [0.0, 0.0],意味着目标就在原点,而机器人出生在半径3米的圆上且 boundary_radius = 3.5,初始位置几乎在边界上。

修复reset() 应使用 cfg.init_state.pos 作为基础位置 + 小范围XY随机化 (pos_randomization_range),而非极坐标外圈生成。保留极坐标逻辑但作为可选模式。

Bug 2:base_contact 传感器阈值过低(0.01),在着地瞬间立即触发终止

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L508-L520)

机器人从 0.5m 高度生成后掉落,着地瞬间 base_contact_value > 0.01 立即为 True,episode 瞬间结束。没有宽限期(grace period)。

修复

  1. 提高 base_contact 阈值从 0.010.1
  2. 添加 grace period:前 50 步(约 0.5 秒)不检测 base_contact 终止条件
  3. 需要在 reset() 返回的 info dict 中确保 "steps" 键存在(初始化为 np.zeros(num_envs, dtype=np.int32)

Bug 3:奖励函数中硬编码 -10.0 极端惩罚在初始帧触发

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L800-L803)

reward = np.where(orientation_penalty > 0.5, reward - 10.0, reward)

orientation_penalty = sum(gravity_xy^2) > 0.5 对应约 ~45° 倾斜。机器人从 0.5m 自由落体着地瞬间很容易触发,导致该步奖励直接 -10。

修复:将硬编码 -10.0 替换为渐进式惩罚:

extreme_tilt_penalty = np.clip((orientation_penalty - 0.5) * 5.0, 0.0, 3.0)
reward = reward - extreme_tilt_penalty

Bug 4:reset() 返回的 info 缺少 "last_distance""steps"

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L1037-L1044)

_compute_reward() 中使用 info.get("last_distance", distance_to_target) 来计算 progress。虽然首步 fallback 为自身(progress=0),但如果机器人在第一步掉落远离目标则 progress 为负。更关键的是 info 中缺少 "steps" 键,导致 grace period 机制无法工作。

修复:在 reset()info dict 中添加:

"last_distance": distance_to_target.copy(),
"steps": np.zeros(num_envs, dtype=np.int32),

需要修改的文件

1. motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py

修改 reset() 方法(约 L811-L890):

  • 将极坐标外圈随机生成替换为基于 cfg.init_state.pos + pos_randomization_range 的初始化
  • 使用 cfg.init_state.pos[2](或稍微降低到 0.35m)作为初始高度,减少自由落体冲击
  • 在返回的 info dict 中添加 "last_distance""steps"

具体实现:

def reset(self, data: mtx.SceneData, done: np.ndarray = None) -> tuple[np.ndarray, dict]:
    cfg: VBotSection001EnvCfg = self._cfg
    num_envs = data.shape[0]

    # 使用 cfg.init_state.pos 作为基础位置
    base_pos = np.array(cfg.init_state.pos, dtype=np.float32)
    robot_init_pos = np.tile(base_pos, (num_envs, 1))
    
    # 小范围XY随机化
    if hasattr(cfg.init_state, 'pos_randomization_range'):
        pr = cfg.init_state.pos_randomization_range
        xy_noise = np.random.uniform(
            [pr[0], pr[1]], [pr[2], pr[3]], (num_envs, 2)
        ).astype(np.float32)
        robot_init_pos[:, :2] += xy_noise
    
    # 降低初始高度,减少掉落冲击(0.5 → 0.35)
    robot_init_pos[:, 2] = 0.35

    dof_pos = np.tile(self._init_dof_pos, (num_envs, 1))
    dof_vel = np.tile(self._init_dof_vel, (num_envs, 1))
    # ... 域随机化逻辑保持不变 ...

在 info dict 中添加(约 L1037):

info = {
    "pose_commands": pose_commands,
    "last_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "current_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "filtered_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "ever_reached": np.zeros(num_envs, dtype=bool),
    "min_distance": distance_to_target.copy(),
    "last_distance": distance_to_target.copy(),   # ✅ 新增
    "steps": np.zeros(num_envs, dtype=np.int32),   # ✅ 新增
}

修改 _compute_terminated() 方法(约 L480-L522):

  • 提高 base_contact 阈值从 0.010.1
  • 添加 grace period:前 50 步不检测 base_contact
  • state.info 中获取 steps 并判断是否过了宽限期
def _compute_terminated(self, state: NpEnvState) -> NpEnvState:
    data = state.data
    num_envs = self._num_envs
    terminated = np.zeros(num_envs, dtype=bool)
    
    # 1. Timeout
    if self._cfg.max_episode_steps:
        timeout = state.info["steps"] >= self._cfg.max_episode_steps
        terminated = np.logical_or(terminated, timeout)
    
    # 2. Extreme tilt (保持不变)
    root_pos, root_quat, root_vel = self._extract_root_state(data)
    recovery_tilt_threshold = getattr(self._cfg, 'recovery_tilt_threshold', 80.0)
    tilt_threshold_rad = np.deg2rad(recovery_tilt_threshold)
    gravity = self._compute_projected_gravity(root_quat)
    tilt_angle = np.arccos(np.clip(gravity[:, 2], -1.0, 1.0))
    extreme_tilt = tilt_angle > tilt_threshold_rad
    terminated = np.logical_or(terminated, extreme_tilt)
    
    # 3. Base contact — 添加 grace period + 提高阈值
    GRACE_STEPS = 50
    current_steps = state.i...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI and others added 23 commits February 11, 2026 11:56
- Add target_point_a (0.0, 1.5) and target_point_b (0.0, 0.0) to VBotSection001EnvCfg
- Modify reset() to use fixed unified target positions instead of random offsets
- All robots now navigate to the same target point (inner circle trigger point)

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
修复VBot导航,使其使用统一的目标位置进行竞技场计分
…tness

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
…zation

通过域随机化和奖励重加权来提高VBot导航的鲁棒性
…ments, forced exploration

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
…ation

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
…weights

Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold
Copilot AI and others added 4 commits February 12, 2026 02:04
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Copilot AI and others added 3 commits February 12, 2026 02:11
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix immediate termination issue in VBot training Fix VBot immediate termination on spawn Feb 12, 2026
Copilot AI requested a review from Logic-TARS February 12, 2026 02:14
@Logic-TARS
Copy link
Copy Markdown
Owner

@copilot git commit and push to remote

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 22, 2026

@copilot git commit and push to remote

The branch is already up to date with the remote. All changes from the PR have been committed and pushed:

  • Latest commit: d213ae6 "Fix test logic bug and improve reset() comment clarity"
  • Branch status: copilot/fix-immediate-termination-bug is synchronized with origin/copilot/fix-immediate-termination-bug
  • Working tree: clean (no uncommitted changes)

The PR includes 7 commits with all 4 bug fixes, tests, and documentation. No further commits or pushes are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants