Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold by Copilot · Pull Request #5 · Logic-TARS/VbotCompetition

Copilot · 2026-02-11T14:43:05Z

Problem

VBot agents converge to stationary policy (zero velocity) to minimize falling penalties, preventing goal-reaching behavior. The reward structure creates a local optimum where risk avoidance dominates exploration.

Solution

1. Reward Function Restructuring

Inverted penalty-reward ratio to achieve 17.5:1 positive dominance:

# cfg.py - RewardConfig
scales = {
    # Strong positive incentives
    "forward_velocity": 2.0,           # NEW: 4x increase from 0.5
    "position_tracking": 1.5,
    
    # Minimal penalties (3-4x reduction)
    "orientation": -0.05,              # was -0.20
    "lin_vel_z": -0.10,                # was -0.30
    "ang_vel_xy": -0.05,               # was -0.15
    
    # Gait formation incentives
    "foot_air_time": 0.3,              # was 0.1
    "action_smoothness": 0.1,          # was -0.01 (flipped)
    "contact_stability": 0.2,          # was 0.1
    
    # Near-zero regularization
    "torques": -0.00001,               # was -1e-5
    "action_rate": -0.001,             # was -0.01
}

Added forward velocity reward computation:

# vbot_section001_np.py - _compute_reward()
forward_direction = position_error / (distance_to_target[:, np.newaxis] + 1e-6)
forward_velocity = np.sum(base_lin_vel[:, :2] * forward_direction, axis=1)
forward_velocity_reward = np.clip(forward_velocity, 0.0, 2.0)

Math: Total positive potential = 3.5, total penalties ≈ 0.2 → 17.5:1 ratio encourages exploration.

2. Recovery-Enabled Termination

Rewrote _compute_terminated() to allow self-correction:

Tilt threshold: 45-60° → 80° (configurable via recovery_tilt_threshold)
33-77% more tolerance for learning balance recovery
Maintains base contact detection for safety

3. Initial State Diversification

Modified reset() to force ~1/3 of environments to start with random velocity (±0.3 m/s XY), breaking the zero-velocity attractor basin.

4. Configuration Parameters

Added to VBotSection001EnvCfg:

force_initial_motion: bool = True
recovery_tilt_threshold: float = 80.0

Expected Impact

Metric	Before	Expected
Avg velocity	~0.0 m/s	>0.5 m/s
Success rate	0%	30-50%
Tilt tolerance	45-60°	80°

Files Changed

cfg.py: Reward hierarchy + parameters (43 lines)
vbot_section001_np.py: Termination, reward computation, reset logic (155 lines)
Validation scripts + documentation (497 lines)

Original prompt

问题描述

VBot 导航模型陷入零速度局部最优解（Zero-Velocity Local Optimum），严重阻碍学习进度：

🔴 当前症状

Agent 学会了通过保持静止来避免跌倒惩罚
策略表现为极端风险规避（所有关节接近零速度）
无法探索出"移动且不跌倒"的策略
虽然不再频繁跌倒，但也无法获取目标相关的速度奖励

🎯 根本原因

惩罚权重过大：摔倒惩罚 >> 移动奖励
探索不足：没有足够的成功移动样本引导学习
奖励结构缺陷：缺乏步态形成的中间奖励
终止条件过严：任何倾斜都导致终止，放大了失败的代价

解决方案（四层递进）

🟢 第一层：重塑奖励函数（CRITICAL）

1.1 奖励权重调整

修改 cfg.py 中的 RewardConfig：

@dataclass
class RewardConfig:
    scales: dict[str, float] = field(
        default_factory=lambda: {
            # ===== 第一梯队：强正奖励（驱动移动）=====
            "forward_velocity": 2.0,              # NEW: 前进速度奖励（强驱动力）
            "position_tracking": 1.5,             # 目标追踪（保留）
            
            # ===== 第二梯队：弱惩罚（不禁止探索）=====
            # 关键改变：大幅降低惩罚权重，让正奖励主导学习
            "orientation": -0.05,                 # ⬇️ 从 -0.20 → -0.05 (4倍降低)
            "lin_vel_z": -0.10,                   # ⬇️ 从 -0.30 → -0.10 (3倍降低)
            "ang_vel_xy": -0.05,                  # ⬇️ 从 -0.15 → -0.05 (3倍降低)
            
            # ===== 第三梯队：细粒度奖励（形成步态）=====
            "foot_air_time": 0.3,                 # 鼓励规律摆腿
            "action_smoothness": 0.1,             # 鼓励平滑动作
            "contact_stability": 0.2,             # 鼓励足部接触
            
            # ===== 第四梯队：极弱惩罚（几乎无影响）=====
            "torques": -0.00001,                  # 极低
            "action_rate": -0.001,                # 极低
        }
    )

数学关键：

总奖励 = 2.0 * forward_velocity - 0.05 * orientation_penalty
正奖励(2.0) > 所有惩罚之和(0.2) ✓
→ Agent 有强烈动机尝试移动

1.2 新增终止时惩罚（避免连续惩罚）

# 在 _compute_reward() 中添加：
terminal_penalty = np.where(
    state.terminated & ~timeout,
    -2.0,  # 摔倒时一次性惩罚
    0.0
)
reward += terminal_penalty

🟠 第二层：改进终止条件（IMPORTANT）

修改 vbot_section001_np.py 中的 _compute_terminated() 方法：

def _compute_terminated(self, state: NpEnvState) -> NpEnvState:
    """
    改进的终止条件：允许机器狗从轻微倾斜中恢复
    只在极端情况下才真正终止
    """
    data = state.data
    num_envs = self._num_envs
    terminated = np.zeros(num_envs, dtype=bool)
    
    # ===== 1. 超时终止（保留，60秒后) =====
    if self._cfg.max_episode_steps:
        timeout = state.info["steps"] >= self._cfg.max_episode_steps
        terminated = np.logical_or(terminated, timeout)
    
    # ===== 2. 改进的摔倒检测（允许恢复）=====
    root_pos, root_quat, root_vel = self._extract_root_state(data)
    
    # 只在"极端"倾斜时才终止（从 60° → 80°）
    # 这给了 Agent 机会从轻微失衡中恢复
    gravity = self._compute_projected_gravity(root_quat)
    tilt_angle = np.arccos(np.clip(gravity[:, 2], -1.0, 1.0))  # 与竖直的夹角
    extreme_tilt = tilt_angle > np.deg2rad(80)  # ⬆️ 从 60° 提升到 80°
    
    terminated = np.logical_or(terminated, extreme_tilt)
    
    # ===== 3. 基座接触地面（保留）=====
    try:
        cquerys = self._model.get_contact_query(data)
        termination_check = cquerys.is_colliding(self.termination_contact)
        base_contact = termination_check.reshape((num_envs, -1)).any(axis=1)
        terminated = np.logical_or(terminated, base_contact)
    except Exception as e:
        print(f"Warning: Could not check base contact: {e}")
    
    return state.replace(terminated=terminated)

改变的含义：

❌ 之前：轻微倾斜(>60°) → 立即终止 → 失败惩罚
✅ 之后：轻微倾斜(≤80°) → 允许恢复 → 学习平衡
效果：Agent 有机会尝试自我纠正，积累"不跌倒"的样本

🔵 第三层：强制初始探索（加强探索）

在 reset() 方法中添加：

def reset(self, data: mtx.SceneData, done: np.ndarray = None):
    cfg: VBotSection001EnvCfg = self._cfg
    num_envs = data.shape[0]
    
    # ... 既有初始化代码 ...
    
    dof_pos = np.tile(self._init_dof_pos, (num_envs, 1))
    dof_vel = np.tile(self._init_dof_vel, (num_envs, 1))
    
    # ===== 新增：强制初始化运动（打破零速度陷阱）=====
    if hasattr(cfg, 'force_initial_motion') and cfg.force_initial_motion:
        # 强制 1/3 的环境有初始速度（而非静止）
        num_moving = max(1, num_envs // 3)
        moving_indices = np.random.choice(num_envs, num_moving, replace=False)
        
        # 随机初始前进速度（轻微推力）
        initial_push = np.random.uniform(
            -0.3, 0.3,
            (num_moving, 2)  # XY方向速度 ±0.3 m/s
        )
        dof_vel[moving_indices, 3:5] = initial_push
    
    # 设置到物理引擎
    data.set_dof_vel(dof_vel)
    data.set_dof_pos(dof_pos, self._model)
    self._model.forward_kinematic(data)
    
    # ... 其余代码 ...

为什么有效：

强制一些环境从非零速度开始
Agent 被迫学习"保持前进"而非"停止"
打破零速度陷阱的初始条件

🟣 第四层：配置参数调整

在 cfg.py 中添加：

@dataclass
class VBotSection001EnvCfg(VBotStairsEnvCfg):
    # ... 既有配置 ...
    
    # ===== 新增：零速度陷阱恢复参数 =====
    force_initial_motion: bool = True              # 强制初始速度
    recovery_tilt_threshold: float = 80.0          # 允许恢复的倾斜角度阈值(度)
    
    @dataclass
    class RewardConfig:
        # ... 调整后的权重（见第一层）...

验证指标

指标	修复前	修复后	目标
平均速度	~0.0 m/s	> 0.5 m/s	> 1.0 m/s
成功率	0% (无法到达)	30-50%	70%+
最高倾斜	任何倾斜都终止	允许 <80°	动态恢复
探索深度	停止状态	移动状态	多样化行为

修复检查清单

奖励权重完全重塑（正奖励 > 惩罚）
终止条件放宽（60° → 80°）
强制初始速度实现（打破零速度）
-...

This pull request was created from Copilot chat.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…ments, forced exploration Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

…ation Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Initial plan

2dc8327

Copilot AI assigned Copilot and Logic-TARS Feb 11, 2026

Copilot started work on behalf of Logic-TARS February 11, 2026 14:43 View session

Copilot AI and others added 7 commits February 11, 2026 14:47

Implement zero-velocity trap fix: reward reshape, termination improve…

d80678c

…ments, forced exploration Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Add validation scripts and documentation for zero-velocity trap fix

c7c0d1f

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Address code review: translate Chinese comments to English

6f19c52

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Translate all Chinese comments to English for code consistency

0956028

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Final translation: convert all remaining Chinese comments to English

e861012

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Address code review: clarify comments and improve test logic document…

8f73dd7

…ation Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Add implementation completion summary documentation

73539c1

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Adjust reward function to mitigate zero-velocity local optimum~~ Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold Feb 11, 2026

Copilot AI requested a review from Logic-TARS February 11, 2026 14:59

Copilot finished work on behalf of Logic-TARS February 11, 2026 14:59

Logic-TARS marked this pull request as ready for review February 11, 2026 15:00

Logic-TARS merged commit 17ec9e4 into main Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold#5

Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold#5
Logic-TARS merged 8 commits intomainfrom
copilot/adjust-reward-function-weights

Copilot AI commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Reward Function Restructuring

2. Recovery-Enabled Termination

3. Initial State Diversification

4. Configuration Parameters

Expected Impact

Files Changed

问题描述

🔴 当前症状

🎯 根本原因

解决方案（四层递进）

🟢 第一层：重塑奖励函数（CRITICAL）

1.1 奖励权重调整

1.2 新增终止时惩罚（避免连续惩罚）

🟠 第二层：改进终止条件（IMPORTANT）

🔵 第三层：强制初始探索（加强探索）

🟣 第四层：配置参数调整

验证指标

修复检查清单

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 11, 2026 •

edited

Loading