Skip to content

Weight "executed code" more prominently #233

@zimmski

Description

@zimmski

In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:

  • Executable code should be weighted much higher
  • Coverage is still very important

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpostponedThis issue/PR is postponed until there is a very good reason (e.g. $$$) to implement it.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions