Why the output are different between my_env.py and evaluate.py? #355

wangchen11maker · 2024-10-01T07:31:00Z

wangchen11maker
Oct 1, 2024

我在代码中定义了自己的环境，训练一个比较小的步数后，想对模型进行测试。my_env.py在step20步后truncated 确实变成了tensor（false），符合我的env中的设置，然而却发现evaluator.py 中的render函数运行了234次才停止？
这是为什么？
对evaluator.py的render函数中的“obs, rew, cost, terminated, truncated, _ = self._env.step(act)”中的五个值进行打印，发现与my_env.py中的step函数return的值不一样？是因为不同实体的原因吗？请求解答，谢谢！

I define my own environment in the code, and I want to evaluate the saved model. Acturally, in my_env.py, I set 20 steps as the end condition in my_env. But when I evaluate saved model, I find truncated does become a tensor (false) after step 20 in my_env.py, matching the Settings in my env, but then the render function in evaluator.py runs 234 times ?Why is that?
Print the five values from "obs, rew, cost, terminated, truncated, _ = self._env.step(act)" in the render function of evaluator. Not the same as the return value of the step function in my_env.py? Is it because of the different entities? Answer requested, thank you!

wangchen11maker · 2024-10-01T07:36:02Z

wangchen11maker
Oct 1, 2024
Author

evaluator.py：
def render( # pylint: disable=too-many-locals,too-many-arguments,too-many-branches,too-many-statements
self,
num_episodes: int = 1,
save_replay_path: str | None = None,
max_render_steps: int = 2000,
cost_criteria: float = 1.0,
) -> None: # pragma: no cover
"""Render the environment for one episode.

    Args:
        num_episodes (int, optional): The number of episodes to render. Defaults to 1.
        save_replay_path (str or None, optional): The path to save the replay video. Defaults to
            None.
        max_render_steps (int, optional): The maximum number of steps to render. Defaults to 2000.
        cost_criteria (float, optional): The discount factor for the cost. Defaults to 1.0.
    """
    assert (
        self._env is not None
    ), 'The environment must be provided or created before rendering.'
    assert (
        self._actor is not None or self._planner is not None
    ), 'The policy or planner must be provided or created before rendering.'
    if save_replay_path is None:
        save_replay_path = os.path.join(self._save_dir, 'video', self._model_name.split('.')[0])
    result_path = os.path.join(save_replay_path, 'result.txt')
    print(self._dividing_line)
    print(f'Saving the replay video to {save_replay_path},\n and the result to {result_path}.')
    print(self._dividing_line)

    horizon = 1000
    frames = []
    obs, _ = self._env.reset()
    if self._render_mode == 'human':
        self._env.render()
    elif self._render_mode == 'rgb_array':
        frames.append(self._env.render())

    episode_rewards: list[float] = []
    episode_costs: list[float] = []
    episode_lengths: list[float] = []

    # print(num_episodes)  # 1
    # print(max_render_steps)  # 2000
    for episode_idx in range(1):  # num_episodes
        self._safety_obs = torch.ones(1)
        step = 0
        done = False
        ep_ret, ep_cost, length = 0.0, 0.0, 0.0
        while (
            not done and step <= max_render_steps
        ):  # a big number to make sure the episode will end
            if 'Saute' in self._cfgs['algo'] or 'Simmer' in self._cfgs['algo']:
                obs = torch.cat([obs, self._safety_obs], dim=-1)
            with torch.no_grad():
                if self._actor is not None:
                    act = self._actor.predict(
                        obs.reshape(
                            -1,
                            obs.shape[-1],  # to make sure the shape is (1, obs_dim)
                        ),
                        deterministic=True,
                    ).reshape(
                        -1,  # to make sure the shape is (act_dim,)
                    )
                elif self._planner is not None:
                    act = self._planner.output_action(
                        obs.unsqueeze(0).to('cpu'),
                    )[
                        0
                    ].squeeze(0)
                else:
                    raise ValueError(
                        'The policy must be provided or created before evaluating the agent.',
                    )

            # print(step)
            obs, rew, cost, terminated, truncated, _ = self._env.step(act)

            print('render' + str(step))  

            print(obs)   # here1
            print(rew)
            print(cost)
            print(terminated)
            print(truncated)

            if 'Saute' in self._cfgs['algo'] or 'Simmer' in self._cfgs['algo']:
                self._safety_obs -= cost.unsqueeze(-1) / self._safety_budget
                self._safety_obs /= self._cfgs.algo_cfgs.saute_gamma
            step += 1
            done = bool(terminated or truncated)

my_env.py:
def step(
self,
action: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, dict]:
"""Run one timestep of the environment's dynamics using the agent actions.

    .. note::
        You need to implement dynamic features related to environment interaction here. That is:

        1. Update the environment state based on the action;
        2. Calculate reward and cost based on the environment state;
        3. Determine whether to terminate based on the environment state;
        4. Record the information you need.

    Args:
        action (torch.Tensor): The action from the agent or random.

    Returns:
        observation: The agent's observation of the current environment.
        reward: The amount of reward returned after previous action.
        cost: The amount of cost returned after previous action.
        terminated: Whether the episode has ended.
        truncated: Whether the episode has been truncated due to a time limit.
        info: Some information logged by the environment.
    """
    self._count += 1
    # print(action)

    """ 状态转移 """
    # 状态：UAV绝对坐标，目标的绝对坐标
    # 动作：UAV的速度和方向
    uav_obs_curr = self.lastState
    uav_goal_obs = np.array((-1.0, 1.0, 0.0))
    uav_obs_new = uav_obs_curr + move

    state_array = np.concatenate((uav_obs_new, uav_goal_obs))
    obs = torch.as_tensor(state_array)
    obs = obs.float()

    """ 奖励设置 """
    reward = 1   # 总的奖励
    reward = torch.as_tensor(reward)

    """ 成本设置 """
    cost = -1 if ...
    cost = torch.as_tensor(cost)
    # print(cost.dtype)


    """ 是否结束 """
    terminated = torch.tensor(arrive_flag)
    terminated = torch.as_tensor(terminated)
    # print(terminated.dtype)

    # truncated = torch.as_tensor(self._count > self._max_episode_steps)
    if self._count > self._max_episode_steps:
        over_step_flag = 1
    if self._count <= self._max_episode_steps:
        over_step_flag = 0
    truncated = torch.tensor(over_step_flag)
    truncated = torch.as_tensor(truncated)

    print('env' + str(self._count))   # here2
    print(obs)
    print(reward)
    print(cost)
    print(terminated)
    print(truncated)

    """ 更新本地的状态 """
    self.lastState = uav_obs_new  
    return obs, reward, cost, terminated, truncated, {'final_observation': obs}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the output are different between my_env.py and evaluate.py? #355

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Why the output are different between my_env.py and evaluate.py? #355

wangchen11maker Oct 1, 2024

Replies: 1 comment

wangchen11maker Oct 1, 2024 Author

wangchen11maker
Oct 1, 2024

wangchen11maker
Oct 1, 2024
Author