Fix VBot immediate termination on spawn by Copilot · Pull Request #6 · Logic-TARS/VbotCompetition

Copilot · 2026-02-12T02:00:18Z

VBot robots were terminating within 1-5 steps of spawning (mean reward ~0.29 vs expected >2.0) due to four interrelated bugs in reset and termination logic.

Changes

1. Reset position generation

Before: Polar coordinate generation on outer circle (radius 3.0m) ignored cfg.init_state.pos
After: Use curriculum learning position with XY randomization only

# Old: spawn on outer circle, far from target
for i in range(num_envs):
    theta = np.random.uniform(0, 2 * np.pi)
    radius = cfg.arena_outer_radius + np.random.uniform(-0.1, 0.1)
    robot_init_xy[i] = [radius * np.cos(theta), radius * np.sin(theta)]

# New: spawn at curriculum position with small noise
base_pos = np.array(cfg.init_state.pos, dtype=np.float32)  # [0.0, 0.6, 0.35]
robot_init_pos = np.tile(base_pos, (num_envs, 1))
robot_init_pos[:, :2] += np.random.uniform([pr[0], pr[1]], [pr[2], pr[3]], (num_envs, 2))

Also: init_state.pos[2] 0.5→0.35m to reduce fall impact, boundary_radius 3.5→5.0m

2. Base contact termination

Before: Threshold 0.01 with no grace period → instant termination on landing
After: Threshold 0.1 with 50-step grace period

GRACE_STEPS = 50
current_steps = state.info.get("steps", np.zeros(num_envs, dtype=np.int32))
past_grace = current_steps > GRACE_STEPS
base_contact = (base_contact_value > 0.1).flatten()[:num_envs]  # was 0.01
terminated = np.logical_or(terminated, base_contact & past_grace)

Added steps increment in update_state(): state.info["steps"] = state.info["steps"] + 1

3. Orientation penalty

Before: Hard-coded -10.0 on landing (>45° tilt)
After: Progressive penalty capped at -3.0

# Old: extreme spike
reward = np.where(orientation_penalty > 0.5, reward - 10.0, reward)

# New: smooth gradient
extreme_tilt_penalty = np.clip((orientation_penalty - 0.5) * 5.0, 0.0, 3.0)
reward = reward - extreme_tilt_penalty

4. Info dict initialization

Added missing keys in reset():

"last_distance": distance_to_target.copy() - for progress reward calculation
"steps": np.zeros(num_envs, dtype=np.int32) - for grace period tracking

Testing

test_immediate_termination_fix.py validates all four fixes independently and in integration.

Original prompt

问题描述

VBot 训练中出现"刚出生就死"（Immediate Termination）现象。运行 uv run scripts/train.py --env vbot_navigation_section001 时，机器人在视频中"半空消失"，奖励极低：

Total reward (max) ≈ 1.7-1.9（正常应远大于此）
Total reward (mean) ≈ 0.29（几乎为零）
Total reward (min) 从 -2.5 快速收敛到 -0.3

如图所示（TensorBoard 奖励曲线）：

根因分析

经过代码审查，发现 4 个关键 Bug：

Bug 1：`reset()` 中极坐标随机生成覆盖了课程学习的近距离起点

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L815-L828)

reset() 使用 arena_outer_radius = 3.0 在外圈极坐标随机生成初始位置，完全忽略了 cfg.init_state.pos = [0.0, 0.6, 0.5]（课程学习第一阶段的近距离起点）。同时 target_point_a = [0.0, 0.0]，意味着目标就在原点，而机器人出生在半径3米的圆上且 boundary_radius = 3.5，初始位置几乎在边界上。

修复：reset() 应使用 cfg.init_state.pos 作为基础位置 + 小范围XY随机化 (pos_randomization_range)，而非极坐标外圈生成。保留极坐标逻辑但作为可选模式。

Bug 2：`base_contact` 传感器阈值过低（0.01），在着地瞬间立即触发终止

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L508-L520)

机器人从 0.5m 高度生成后掉落，着地瞬间 base_contact_value > 0.01 立即为 True，episode 瞬间结束。没有宽限期（grace period）。

修复：

提高 base_contact 阈值从 0.01 → 0.1
添加 grace period：前 50 步（约 0.5 秒）不检测 base_contact 终止条件
需要在 reset() 返回的 info dict 中确保 "steps" 键存在（初始化为 np.zeros(num_envs, dtype=np.int32)）

Bug 3：奖励函数中硬编码 `-10.0` 极端惩罚在初始帧触发

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L800-L803)

reward = np.where(orientation_penalty > 0.5, reward - 10.0, reward)

orientation_penalty = sum(gravity_xy^2) > 0.5 对应约 ~45° 倾斜。机器人从 0.5m 自由落体着地瞬间很容易触发，导致该步奖励直接 -10。

修复：将硬编码 -10.0 替换为渐进式惩罚：

extreme_tilt_penalty = np.clip((orientation_penalty - 0.5) * 5.0, 0.0, 3.0)
reward = reward - extreme_tilt_penalty

Bug 4：`reset()` 返回的 `info` 缺少 `"last_distance"` 和 `"steps"` 键

文件: motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py (L1037-L1044)

_compute_reward() 中使用 info.get("last_distance", distance_to_target) 来计算 progress。虽然首步 fallback 为自身（progress=0），但如果机器人在第一步掉落远离目标则 progress 为负。更关键的是 info 中缺少 "steps" 键，导致 grace period 机制无法工作。

修复：在 reset() 的 info dict 中添加：

"last_distance": distance_to_target.copy(),
"steps": np.zeros(num_envs, dtype=np.int32),

需要修改的文件

1. `motrix_envs/src/motrix_envs/navigation/vbot/vbot_section001_np.py`

修改 `reset()` 方法（约 L811-L890）：

将极坐标外圈随机生成替换为基于 cfg.init_state.pos + pos_randomization_range 的初始化
使用 cfg.init_state.pos[2]（或稍微降低到 0.35m）作为初始高度，减少自由落体冲击
在返回的 info dict 中添加 "last_distance" 和 "steps" 键

具体实现：

def reset(self, data: mtx.SceneData, done: np.ndarray = None) -> tuple[np.ndarray, dict]:
    cfg: VBotSection001EnvCfg = self._cfg
    num_envs = data.shape[0]

    # 使用 cfg.init_state.pos 作为基础位置
    base_pos = np.array(cfg.init_state.pos, dtype=np.float32)
    robot_init_pos = np.tile(base_pos, (num_envs, 1))
    
    # 小范围XY随机化
    if hasattr(cfg.init_state, 'pos_randomization_range'):
        pr = cfg.init_state.pos_randomization_range
        xy_noise = np.random.uniform(
            [pr[0], pr[1]], [pr[2], pr[3]], (num_envs, 2)
        ).astype(np.float32)
        robot_init_pos[:, :2] += xy_noise
    
    # 降低初始高度，减少掉落冲击（0.5 → 0.35）
    robot_init_pos[:, 2] = 0.35

    dof_pos = np.tile(self._init_dof_pos, (num_envs, 1))
    dof_vel = np.tile(self._init_dof_vel, (num_envs, 1))
    # ... 域随机化逻辑保持不变 ...

在 info dict 中添加（约 L1037）：

info = {
    "pose_commands": pose_commands,
    "last_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "current_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "filtered_actions": np.zeros((num_envs, self._num_action), dtype=np.float32),
    "ever_reached": np.zeros(num_envs, dtype=bool),
    "min_distance": distance_to_target.copy(),
    "last_distance": distance_to_target.copy(),   # ✅ 新增
    "steps": np.zeros(num_envs, dtype=np.int32),   # ✅ 新增
}

修改 `_compute_terminated()` 方法（约 L480-L522）：

提高 base_contact 阈值从 0.01 → 0.1
添加 grace period：前 50 步不检测 base_contact
从 state.info 中获取 steps 并判断是否过了宽限期

def _compute_terminated(self, state: NpEnvState) -> NpEnvState:
    data = state.data
    num_envs = self._num_envs
    terminated = np.zeros(num_envs, dtype=bool)
    
    # 1. Timeout
    if self._cfg.max_episode_steps:
        timeout = state.info["steps"] >= self._cfg.max_episode_steps
        terminated = np.logical_or(terminated, timeout)
    
    # 2. Extreme tilt (保持不变)
    root_pos, root_quat, root_vel = self._extract_root_state(data)
    recovery_tilt_threshold = getattr(self._cfg, 'recovery_tilt_threshold', 80.0)
    tilt_threshold_rad = np.deg2rad(recovery_tilt_threshold)
    gravity = self._compute_projected_gravity(root_quat)
    tilt_angle = np.arccos(np.clip(gravity[:, 2], -1.0, 1.0))
    extreme_tilt = tilt_angle > tilt_threshold_rad
    terminated = np.logical_or(terminated, extreme_tilt)
    
    # 3. Base contact — 添加 grace period + 提高阈值
    GRACE_STEPS = 50
    current_steps = state.i...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

- Add target_point_a (0.0, 1.5) and target_point_b (0.0, 0.0) to VBotSection001EnvCfg - Modify reset() to use fixed unified target positions instead of random offsets - All robots now navigate to the same target point (inner circle trigger point) Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>