Roadmap for v0.5.0

The v0.5.0 is mainly meant for introducing more variate. There are three main goals
1. Introduce more logical cases, to make sure that "better models" have a bigger difference in score.
2. Introduce more providers so we can test models that have been request and react faster to new releases.

Tasks:
- [x] Management
  - [x] add https://github.com/symflower/eval-dev-quality/milestone/2
  - [x] TODO https://github.com/symflower/eval-dev-quality/pulls?q=is%3Aopen+is%3Apr+milestone%3Av0.5.0
- [x] Documentation
  - [x] Less fuzz and fluff (see Thomas' feedback)
  - [x] Bring in the current blog post, its information, especially the blog post image to showcase the evaluation 
  - [x] Readme extension: `The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.`
- [x] Evaluation
  - [x] Measure processing time of queries #106 #105
  - [x] Automate multiple runs for more deterministic results #109 #108
  - [x] Empty responses should be marked as error responses, to indicate that they are on the same level #97 #92 
  - [x] when an LLM is evaluated to successfully do "plain" and Go fails bug Java works, then it will be blocked to do ALL repositories. In that case it should only be blocked for Go and not Java. #129
  - [x] Do clean up of generated files #148 #147 #114 #146 #145 
  - [x] Java
    - [x] Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug `symflower v36847`
  - [x] even powerful models as GPT4 and Llama3 might return EOF or an error (https://github.com/symflower/eval-dev-quality/tree/105%2B108/evaluation-2024-05-14-09%3A18%3A41), add a retry logic: https://github.com/symflower/eval-dev-quality/issues/123 #125 
- [x] Support execution of more models
  - [x] Ollama support #91 #95 #96 #117 #118 #115 #27
  - [x] Allow arbitrary URLs for API provider #111 #112 
- [x] Reporting and Metrics
  - [x] Additional CSVs to sum up overall results, and language individual results #94 #83
  - [x] fix, Y axis ticks should be readable #73
  - [x] fix, Deterministic order of rows in CSV exporting #99 #98
- [x] Multi OS support of eval
  - [x] Support MacOS #102
  - [x] Support Windows #103 #101 #104
- [x] Tools
  - [x]  Introduce unique ID for addressing tools #122 Os-independently
  - [x] Extend `symflower test` with a deeper execution coverage export
    - [x] Go
      - [x] Extract to file
      - [x] Cover lines of tests that have exceptions (fixed with `symflower v36800`) #144 
    - [x] Java
      - [x] Extract to file
      - [x] Cover lines of tests that have exceptions (fixed with `symflower v36800`) #144 
- [x] Tasks and cases
  - [x] Introduce more cases with logic in "light" repository
    - [x] Go #124 
    - [x] Java #134 #124 
- [x] Release
  - [x] Do a full evaluation with the new version
  - [x] Tag version
  - [x] Blog post
  - [x] Adapt README
  - [x] Announce and eat cake


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap for v0.5.0 #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Roadmap for v0.5.0 #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions