Skip to content

Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold#5

Merged
Logic-TARS merged 8 commits intomainfrom
copilot/adjust-reward-function-weights
Feb 11, 2026
Merged

Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold#5
Logic-TARS merged 8 commits intomainfrom
copilot/adjust-reward-function-weights

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 11, 2026

Problem

VBot agents converge to stationary policy (zero velocity) to minimize falling penalties, preventing goal-reaching behavior. The reward structure creates a local optimum where risk avoidance dominates exploration.

Solution

1. Reward Function Restructuring

Inverted penalty-reward ratio to achieve 17.5:1 positive dominance:

# cfg.py - RewardConfig
scales = {
    # Strong positive incentives
    "forward_velocity": 2.0,           # NEW: 4x increase from 0.5
    "position_tracking": 1.5,
    
    # Minimal penalties (3-4x reduction)
    "orientation": -0.05,              # was -0.20
    "lin_vel_z": -0.10,                # was -0.30
    "ang_vel_xy": -0.05,               # was -0.15
    
    # Gait formation incentives
    "foot_air_time": 0.3,              # was 0.1
    "action_smoothness": 0.1,          # was -0.01 (flipped)
    "contact_stability": 0.2,          # was 0.1
    
    # Near-zero regularization
    "torques": -0.00001,               # was -1e-5
    "action_rate": -0.001,             # was -0.01
}

Added forward velocity reward computation:

# vbot_section001_np.py - _compute_reward()
forward_direction = position_error / (distance_to_target[:, np.newaxis] + 1e-6)
forward_velocity = np.sum(base_lin_vel[:, :2] * forward_direction, axis=1)
forward_velocity_reward = np.clip(forward_velocity, 0.0, 2.0)

Math: Total positive potential = 3.5, total penalties ≈ 0.2 → 17.5:1 ratio encourages exploration.

2. Recovery-Enabled Termination

Rewrote _compute_terminated() to allow self-correction:

  • Tilt threshold: 45-60° → 80° (configurable via recovery_tilt_threshold)
  • 33-77% more tolerance for learning balance recovery
  • Maintains base contact detection for safety

3. Initial State Diversification

Modified reset() to force ~1/3 of environments to start with random velocity (±0.3 m/s XY), breaking the zero-velocity attractor basin.

4. Configuration Parameters

Added to VBotSection001EnvCfg:

  • force_initial_motion: bool = True
  • recovery_tilt_threshold: float = 80.0

Expected Impact

Metric Before Expected
Avg velocity ~0.0 m/s >0.5 m/s
Success rate 0% 30-50%
Tilt tolerance 45-60° 80°

Files Changed

  • cfg.py: Reward hierarchy + parameters (43 lines)
  • vbot_section001_np.py: Termination, reward computation, reset logic (155 lines)
  • Validation scripts + documentation (497 lines)
Original prompt

问题描述

VBot 导航模型陷入零速度局部最优解(Zero-Velocity Local Optimum),严重阻碍学习进度:

🔴 当前症状

  • Agent 学会了通过保持静止来避免跌倒惩罚
  • 策略表现为极端风险规避(所有关节接近零速度)
  • 无法探索出"移动且不跌倒"的策略
  • 虽然不再频繁跌倒,但也无法获取目标相关的速度奖励

🎯 根本原因

  1. 惩罚权重过大:摔倒惩罚 >> 移动奖励
  2. 探索不足:没有足够的成功移动样本引导学习
  3. 奖励结构缺陷:缺乏步态形成的中间奖励
  4. 终止条件过严:任何倾斜都导致终止,放大了失败的代价

解决方案(四层递进)

🟢 第一层:重塑奖励函数(CRITICAL)

1.1 奖励权重调整

修改 cfg.py 中的 RewardConfig

@dataclass
class RewardConfig:
    scales: dict[str, float] = field(
        default_factory=lambda: {
            # ===== 第一梯队:强正奖励(驱动移动)=====
            "forward_velocity": 2.0,              # NEW: 前进速度奖励(强驱动力)
            "position_tracking": 1.5,             # 目标追踪(保留)
            
            # ===== 第二梯队:弱惩罚(不禁止探索)=====
            # 关键改变:大幅降低惩罚权重,让正奖励主导学习
            "orientation": -0.05,                 # ⬇️ 从 -0.20 → -0.05 (4倍降低)
            "lin_vel_z": -0.10,                   # ⬇️ 从 -0.30 → -0.10 (3倍降低)
            "ang_vel_xy": -0.05,                  # ⬇️ 从 -0.15 → -0.05 (3倍降低)
            
            # ===== 第三梯队:细粒度奖励(形成步态)=====
            "foot_air_time": 0.3,                 # 鼓励规律摆腿
            "action_smoothness": 0.1,             # 鼓励平滑动作
            "contact_stability": 0.2,             # 鼓励足部接触
            
            # ===== 第四梯队:极弱惩罚(几乎无影响)=====
            "torques": -0.00001,                  # 极低
            "action_rate": -0.001,                # 极低
        }
    )

数学关键

总奖励 = 2.0 * forward_velocity - 0.05 * orientation_penalty
正奖励(2.0) > 所有惩罚之和(0.2) ✓
→ Agent 有强烈动机尝试移动

1.2 新增终止时惩罚(避免连续惩罚)

# 在 _compute_reward() 中添加:
terminal_penalty = np.where(
    state.terminated & ~timeout,
    -2.0,  # 摔倒时一次性惩罚
    0.0
)
reward += terminal_penalty

🟠 第二层:改进终止条件(IMPORTANT)

修改 vbot_section001_np.py 中的 _compute_terminated() 方法:

def _compute_terminated(self, state: NpEnvState) -> NpEnvState:
    """
    改进的终止条件:允许机器狗从轻微倾斜中恢复
    只在极端情况下才真正终止
    """
    data = state.data
    num_envs = self._num_envs
    terminated = np.zeros(num_envs, dtype=bool)
    
    # ===== 1. 超时终止(保留,60秒后) =====
    if self._cfg.max_episode_steps:
        timeout = state.info["steps"] >= self._cfg.max_episode_steps
        terminated = np.logical_or(terminated, timeout)
    
    # ===== 2. 改进的摔倒检测(允许恢复)=====
    root_pos, root_quat, root_vel = self._extract_root_state(data)
    
    # 只在"极端"倾斜时才终止(从 60° → 80°)
    # 这给了 Agent 机会从轻微失衡中恢复
    gravity = self._compute_projected_gravity(root_quat)
    tilt_angle = np.arccos(np.clip(gravity[:, 2], -1.0, 1.0))  # 与竖直的夹角
    extreme_tilt = tilt_angle > np.deg2rad(80)  # ⬆️ 从 60° 提升到 80°
    
    terminated = np.logical_or(terminated, extreme_tilt)
    
    # ===== 3. 基座接触地面(保留)=====
    try:
        cquerys = self._model.get_contact_query(data)
        termination_check = cquerys.is_colliding(self.termination_contact)
        base_contact = termination_check.reshape((num_envs, -1)).any(axis=1)
        terminated = np.logical_or(terminated, base_contact)
    except Exception as e:
        print(f"Warning: Could not check base contact: {e}")
    
    return state.replace(terminated=terminated)

改变的含义

  • ❌ 之前:轻微倾斜(>60°) → 立即终止 → 失败惩罚
  • ✅ 之后:轻微倾斜(≤80°) → 允许恢复 → 学习平衡
  • 效果:Agent 有机会尝试自我纠正,积累"不跌倒"的样本

🔵 第三层:强制初始探索(加强探索)

reset() 方法中添加:

def reset(self, data: mtx.SceneData, done: np.ndarray = None):
    cfg: VBotSection001EnvCfg = self._cfg
    num_envs = data.shape[0]
    
    # ... 既有初始化代码 ...
    
    dof_pos = np.tile(self._init_dof_pos, (num_envs, 1))
    dof_vel = np.tile(self._init_dof_vel, (num_envs, 1))
    
    # ===== 新增:强制初始化运动(打破零速度陷阱)=====
    if hasattr(cfg, 'force_initial_motion') and cfg.force_initial_motion:
        # 强制 1/3 的环境有初始速度(而非静止)
        num_moving = max(1, num_envs // 3)
        moving_indices = np.random.choice(num_envs, num_moving, replace=False)
        
        # 随机初始前进速度(轻微推力)
        initial_push = np.random.uniform(
            -0.3, 0.3,
            (num_moving, 2)  # XY方向速度 ±0.3 m/s
        )
        dof_vel[moving_indices, 3:5] = initial_push
    
    # 设置到物理引擎
    data.set_dof_vel(dof_vel)
    data.set_dof_pos(dof_pos, self._model)
    self._model.forward_kinematic(data)
    
    # ... 其余代码 ...

为什么有效

  • 强制一些环境从非零速度开始
  • Agent 被迫学习"保持前进"而非"停止"
  • 打破零速度陷阱的初始条件

🟣 第四层:配置参数调整

cfg.py 中添加:

@dataclass
class VBotSection001EnvCfg(VBotStairsEnvCfg):
    # ... 既有配置 ...
    
    # ===== 新增:零速度陷阱恢复参数 =====
    force_initial_motion: bool = True              # 强制初始速度
    recovery_tilt_threshold: float = 80.0          # 允许恢复的倾斜角度阈值(度)
    
    @dataclass
    class RewardConfig:
        # ... 调整后的权重(见第一层)...

验证指标

指标 修复前 修复后 目标
平均速度 ~0.0 m/s > 0.5 m/s > 1.0 m/s
成功率 0% (无法到达) 30-50% 70%+
最高倾斜 任何倾斜都终止 允许 <80° 动态恢复
探索深度 停止状态 移动状态 多样化行为

修复检查清单

  • 奖励权重完全重塑(正奖励 > 惩罚)
  • 终止条件放宽(60° → 80°)
  • 强制初始速度实现(打破零速度)
    -...

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 7 commits February 11, 2026 14:47
…ments, forced exploration

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
…ation

Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Co-authored-by: Logic-TARS <99871356+Logic-TARS@users.noreply.github.com>
Copilot AI changed the title [WIP] Adjust reward function to mitigate zero-velocity local optimum Fix zero-velocity local optimum in VBot navigation via reward reshaping and recovery threshold Feb 11, 2026
Copilot AI requested a review from Logic-TARS February 11, 2026 14:59
@Logic-TARS Logic-TARS marked this pull request as ready for review February 11, 2026 15:00
@Logic-TARS Logic-TARS merged commit 17ec9e4 into main Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants