Skip to content

Commit a0f48bc

Browse files
authored
Merge pull request #21 from symflower/small-readme
First version of a simple README for the repository on motivation, installation, usage and contributing
2 parents a558ed3 + 6af8336 commit a0f48bc

File tree

2 files changed

+118
-1
lines changed

2 files changed

+118
-1
lines changed

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ clean: # Clean up artifacts of the development environment to allow for untainte
2121
go clean -i -race $(PACKAGE)
2222
.PHONY: clean
2323

24+
help: # Show this help message.
25+
@grep -E '^[a-zA-Z-][a-zA-Z0-9.-]*?:.*?# (.+)' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?# "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'
26+
.PHONY: help
27+
2428
install: # [<Go package] - # Build and install everything, or only the specified package.
2529
go install -v $(PACKAGE)
2630
.PHONY: install

README.md

Lines changed: 114 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,114 @@
1-
# eval-dev-quality
1+
# DevQualityEval
2+
3+
An evaluation benchmark 📈 and framework for LLMs and friends to compare and evolve the quality of code generation.
4+
5+
In recent years AI made great advances in all areas, from art to enhancing productivity in real-world tasks. One of the most notable technologies that brought these advancements is of course the [Large Language model (LLM)](https://en.wikipedia.org/wiki/Large_language_model). LLMs are useful in many areas but it seems that they are especially useful for the generation of source code. However, benchmarks for this area have been lacking, while benchmarks for general software development tasks are almost non-existent.
6+
7+
This repository gives developers of LLMs (and other code generation tools) a standardized benchmark and framework to improve real-world usage in the software development domain and provides users of LLMs with metrics to check if a given LLM is useful for their tasks.
8+
9+
## Installation
10+
11+
[Install Git](https://git-scm.com/downloads), [install Go](https://go.dev/doc/install), and then execute the following commands:
12+
13+
```bash
14+
git clone https://github.com/symflower/eval-dev-quality.git
15+
cd eval-dev-quality
16+
go install -v github.com/symflower/eval-dev-quality/cmd/eval-dev-quality
17+
```
18+
19+
You can now use the `eval-dev-quality` binary to [execute the benchmark](#usage).
20+
21+
## Usage
22+
23+
> REMARK This project does not currently implement a sandbox for executing code. Make sure that you are running benchmarks only inside of a sandbox, e.g. at least a container.
24+
25+
At the moment, the only LLM provider implemented is [openrouter.ai](https://openrouter.ai/). You need to create an [access key](https://openrouter.ai/keys) and save it in an environment variable:
26+
27+
```bash
28+
export PROVIDER_TOKEN=openrouter:${your-key}
29+
```
30+
31+
Then you can run all benchmark tasks on all models and repositories:
32+
33+
```bash
34+
eval-dev-quality evaluate
35+
```
36+
37+
The output of the commands is a detailed log of all the requests and responses to the models and of all the commands executed. After the execution, you can find the final result saved to the file `evaluation.csv`.
38+
39+
See `eval-dev-quality --help` and especially `eval-dev-quality evaluate --help` for options.
40+
41+
### Example usage: Evaluate only one or more models
42+
43+
In the case you only want to evaluate only one or more models you can use the `--model` option to define a model you want to use. You can use this option with as many models as you want.
44+
45+
Executing the following output:
46+
47+
```bash
48+
eval-dev-quality evaluate --model=openrouter/anthropic/claude-3-opus
49+
```
50+
51+
Should return an evaluation log like this:
52+
53+
````plain
54+
2024/04/04 13:16:12 Checking that models and languages can be used for evaluation
55+
2024/04/04 13:16:12 Evaluating model "openrouter/anthropic/claude-3-opus" using language "golang" and repository "golang/plain"
56+
2024/04/04 13:16:15 Model "openrouter/anthropic/claude-3-opus" responded to query Given the following Go code file "plain.go" with package "plain", provide a test file for this code.
57+
The tests should produce 100 percent code coverage and must compile.
58+
The response must contain only the test code and nothing else.
59+
60+
```golang
61+
package plain
62+
63+
func plain() {
64+
return // This does not do anything but it gives us a line to cover.
65+
}
66+
```
67+
with: Here's the test file for the given Go code that provides 100 percent code coverage:
68+
69+
```golang
70+
package plain
71+
72+
import "testing"
73+
74+
func Test_plain(t *testing.T) {
75+
plain()
76+
}
77+
```
78+
2024/04/04 13:16:15 $ gotestsum --format standard-verbose --hide-summary skipped -- -cover -v -vet=off ./...
79+
=== RUN Test_plain
80+
--- PASS: Test_plain (0.00s)
81+
PASS
82+
coverage: 100.0% of statements
83+
ok plain 0.001s coverage: 100.0% of statements
84+
85+
DONE 1 tests in 0.169s
86+
2024/04/04 13:16:15 Evaluated model "openrouter/anthropic/claude-3-opus" using language "golang" and repository "golang/plain": encountered 0 problems
87+
2024/04/04 13:16:15 Evaluating models and languages
88+
2024/04/04 13:16:15 Evaluation score for "openrouter/anthropic/claude-3-opus": #executed=100.0%(1/1), #problems=0.0%(0/1), average statement coverage=100.0%
89+
````
90+
91+
The execution by default also creates an evaluation file `evaluation.csv` that contains:
92+
93+
```
94+
model,files-total,files-executed,files-problems,coverage-statement
95+
openrouter/anthropic/claude-3-opus,1,1,0,100
96+
```
97+
98+
# How to extend the benchmark?
99+
100+
If you want to add new files to existing language repositories or new repositories to existing languages, [install the evaluation binary](#installation) of this repository and you are good to go.
101+
102+
To add new tasks to the benchmark, add features, or fix bugs, you'll need a development environment. The development environment comes with this repository and can be installed by executing `make install-all`. Then you can run `make` to see the documentation for all the available commands.
103+
104+
# How to contribute?
105+
106+
First of all, thank you for thinking about contributing! There are multiple ways to contribute:
107+
108+
- Add more files to existing language repositories.
109+
- Add more repositories to languages.
110+
- Implement another language and add repositories for it.
111+
- Implement new tasks for existing languages and repositories.
112+
- Add more features and fix bugs in the evaluation, development environment, or CI: [best to have a look at the list of issues](https://github.com/symflower/eval-dev-quality/issues).
113+
114+
If you want to contribute but are unsure how: [create a discussion](https://github.com/symflower/eval-dev-quality/discussions) or write us directly at [markus.zimmermann@symflower.com](mailto:markus.zimmermann@symflower.com).

0 commit comments

Comments
 (0)