Skip to content

wrong objective/entropy in RLOOTrainer #2281

@serendipity800

Description

@serendipity800

System Info

newest trl version

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

mean_entropy = (-logprobs).sum(1).mean()

This line is wrong, the -logprobs is obtained by:
logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)
ref_logprobs = torch.masked_fill(ref_logprobs, padding_mask, INVALID_LOGPROB)
where INVALID_LOGPROB = 1.0
This will cause objective/entropy < 0 on long padded seqs

Expected behavior

The entropy calculation should be right

Metadata

Metadata

Assignees

Labels

🏋 RLOORelated to RLOO🐛 bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions