how did you figure out continuous? #3

nyck33 · 2019-05-23T16:56:54Z

I can see that your Actor network has tanh activation on the output layer but then I am totally lost as to what you do here:

def act(self, state, memory):
        action_mean = self.actor(state)
        dist = MultivariateNormal(action_mean, torch.diag(self.action_var).to(device))
        action = dist.sample()
        action_logprob = dist.log_prob(action)

Especially action_mean = self.actor(state). Does this mean you have one output node and assume that the output is the mean of a Gaussian distribution over the action space?

Then similar code appears here:

def evaluate(self, state, action):
        action_mean = self.actor(state)
        dist = MultivariateNormal(torch.squeeze(action_mean), torch.diag(self.action_var))
        
        action_logprobs = dist.log_prob(torch.squeeze(action))
        dist_entropy = dist.entropy()
        state_value = self.critic(state)
        
        return action_logprobs, torch.squeeze(state_value), dist_entropy

Also is self.policy like a dummy Actor Critic network that you use just to get updated parameters to load in to self.policy_old? I know this isn't stackoverflow but if you can look at my implementation and let me know how I can adapt it for a continuous action space, that'd be great.

my PPO discrete action space

The text was updated successfully, but these errors were encountered:

nikhilbarhate99 · 2019-05-24T04:54:37Z

Hey, the action distribution is assumed to be a multivariate normal distribution with a diagonal covariance matrix. So, the last layer outputs the means of all variables in action space (i.e. mean vector) and the covariance matrix is just the diagonal matrix of the square of fixed standard deviation (hyper parameter : action_std). From the mean vector and the covariance matrix we can construct a multivariate normal distribution using standard PyTorch function.
Regarding self.policy, Since we update the self.policy for k_epochs (i.e. k times) in one PPO update, we keep the self.policy_old as a copy of old network weights to compute ratios.
I think you should refer to the original PPO paper for more detail.

nyck33 closed this as completed May 24, 2019

This was referenced Jul 8, 2020

loss.mean().backward() crash #31

Closed

in cuda train error expected dtype Double but got dtype Float #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how did you figure out continuous? #3

how did you figure out continuous? #3

nyck33 commented May 23, 2019

nikhilbarhate99 commented May 24, 2019

how did you figure out continuous? #3

how did you figure out continuous? #3

Comments

nyck33 commented May 23, 2019

nikhilbarhate99 commented May 24, 2019