Weight "executed code" more prominently

In v0.5.0 eval run we have the problem that GPT-4 is better than Gemini 1.5 Flash. Gemini has more code that is executable, but GPT has a higher coverage score that is why it is better. However, it makes sense to first order by executable code than coverage. We need to balance:
- Executable code should be weighted much higher 
- Coverage is still very important