It seems that with #94 we introduced some non-determinism into the tests for CSV output. https://github.com/symflower/eval-dev-quality/actions/runs/8987716044/job/24686885243#logs ### Tasks: - [x] Sort the models before writing the CSV output