wrong objective/entropy in RLOOTrainer

### System Info

newest trl version

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder
- [ ] My own task or dataset (give details below)

### Reproduction

mean_entropy = (-logprobs).sum(1).mean()

This line is wrong, the -logprobs is obtained by:
logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)
ref_logprobs = torch.masked_fill(ref_logprobs, padding_mask, INVALID_LOGPROB)
where INVALID_LOGPROB = 1.0
This will cause objective/entropy < 0 on long padded seqs
                
                

### Expected behavior

The entropy calculation should be right

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wrong objective/entropy in RLOOTrainer #2281

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wrong objective/entropy in RLOOTrainer #2281

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions