-
Notifications
You must be signed in to change notification settings - Fork 10
Closed
Labels
roadmapCollection of issues for a releaseCollection of issues for a release
Milestone
Description
The v0.5.0 is mainly meant for introducing more variate. There are three main goals
- Introduce more logical cases, to make sure that "better models" have a bigger difference in score.
- Introduce more providers so we can test models that have been request and react faster to new releases.
Tasks:
- Management
- Documentation
- Less fuzz and fluff (see Thomas' feedback)
- Bring in the current blog post, its information, especially the blog post image to showcase the evaluation
- Readme extension:
The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.
- Evaluation
- Measure processing time of queries Measure processing time of model responses #106 Measure Model response time #105
- Automate multiple runs for more deterministic results Multiple runs in a single evaluation #109 Multiple Runs #108
- Empty responses should be marked as error responses, to indicate that they are on the same level Empty model responses should be handled as errors #97 Empty responses should not be tested but should fail #92
- when an LLM is evaluated to successfully do "plain" and Go fails bug Java works, then it will be blocked to do ALL repositories. In that case it should only be blocked for Go and not Java. Do not cancel successive runs if previous runs had problems #129
- Do clean up of generated files Reset repository per task #148 Repository not reset for multiple tasks #147 Use Git to avoid copying the repository on each model run #114 Use empty Git config in temporary repositories #146 The git repository change requires the GPG password #145
- Java
- Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug
symflower v36847
- Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug
- even powerful models as GPT4 and Llama3 might return EOF or an error (https://github.com/symflower/eval-dev-quality/tree/105%2B108/evaluation-2024-05-14-09%3A18%3A41), add a retry logic: Give models a retry on error #123 Allow to retry a model when it errors #125
- Support execution of more models
- Ollama support Integrate Ollama #91 Ollama tool installation #95 Support Ollama provider #96 Fixed Ollama version #117 Ollama version check and update if version is outdated #118 Prepare for Ollama provider #115 Ollama provider #27
- Allow arbitrary URLs for API provider Generic OpenAI API provider #111 Generic OpenAI API provider #112
- Reporting and Metrics
- Additional CSVs to sum up overall results, and language individual results Additional CSVs to sum up metrics for all models overall and per language #94 Add additional CSV files that sum up: overall, per-language #83
- fix, Y axis ticks should be readable Fix svg Y axis ticks #73
- fix, Deterministic order of rows in CSV exporting Sort map by model before creating the CSV output #99 Non deterministic test output leads to flaky CI Jobs #98
- Multi OS support of eval
- Tools
- Introduce unique ID for addressing tools Introduce an "ID" method to the tool interface #122 Os-independently
- Extend
symflower testwith a deeper execution coverage export- Go
- Extract to file
- Cover lines of tests that have exceptions (fixed with
symflower v36800) Require at least symflower v36800 #144
- Java
- Extract to file
- Cover lines of tests that have exceptions (fixed with
symflower v36800) Require at least symflower v36800 #144
- Go
- Tasks and cases
- Introduce more cases with logic in "light" repository
- Release
- Do a full evaluation with the new version
- Tag version
- Blog post
- Adapt README
- Announce and eat cake
Metadata
Metadata
Assignees
Labels
roadmapCollection of issues for a releaseCollection of issues for a release