The compute_loss function is wrong for the Simplest Policy Gradient

I have been reading the 3 parts from "Introduction to RL" section and I have observed in part 3 that the compute_loss function for the Simplest Policy Gradient returns the mean of the product between the log probabilities of the actions taken by the agent and the weights of those actions, in other words, the finite-horizon undiscounted returns of the episodes in which they were taken.

![image](https://github.com/openai/spinningup/assets/42323683/ec9430c8-d90e-4cda-aa23-0c0b6f0077ac)

In the estimation of the Basic Gradient Policy above, the sums of products is divided by the number of trajectories, but in the implementation, when you return the mean, the sums of products is divided by the number of all the actions taken across all the trajectories from one epoch. Maybe I am understanding this wrong, but I wanted to get a clear picture on the implementation.

![image](https://github.com/openai/spinningup/assets/42323683/9051327a-4b95-4f5c-b847-b26750aa9975)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The compute_loss function is wrong for the Simplest Policy Gradient #414

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

The compute_loss function is wrong for the Simplest Policy Gradient #414

Description

Activity

hirodeng commented on Jun 28, 2024

earnesdm commented on Aug 25, 2024

burichh commented on Jan 30, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions