- Interview questions written by humans, test taken by AI
- Inference scripts for all common API providers and CUDA-enabled quantization runtimes
- Sandbox environment (Docker-based) for untrusted Python and NodeJS code validation
- Evaluate effects of prompting techniques and sampling parameters on LLM coding performance
- Evaluate LLM coding performance degradation due to quantization
11/13 Evaluate Qwen2.5 (32B at FP16, GGUF Q8, EXL2 8bpw), OpenCoder (1.5B and 8B at FP16).
10/26 Evaluate Qwen2.5 (3B, 7B, 14B FP16 | 14B, 32B, 72B AWQ) and Qwen-Coder2.5
10/26 Update evaluations of all available OpenAI, Mistral and Anthropic models.
10/25 Evaluate ibm-granite/granite-3.0 family (8B Dense, 2B Dense, 1B MoE, 3B MoE). Had to take a brief hiatus due to switching jobs but now working to catch up on the backlog so open Issues if there's any interesting code models or new families in the last ~6 weeks I missed! Qwen2.5 and Llama3.2 will be up this weekend.
9/12 Fixed a serialization bug in the evaluator which negatively affected four results: deepseek-ai-DeepSeek-Coder-V2-Lite-Instruct-fp16, ibm-granite-granite-8b-code-instruct-nf4, ajibawa-2023-Code-Llama-3-8B, ollama-phi3:3.8b-mini-instruct-4k-fp16
9/11 Evaluate Yi-Coder-1.5B-Chat and Yi-Coder-9B-Chat (FP16), the 9B in particular is very strong.
junior-v2
is a multi-language (Python, JavaScript) suite of 12 tests created for this project to test small LLM coding performance. This project provides all necessary components to execute this evaluation.
🚧 humaneval
is a Python-only suite of 164 tests created by OpenAI. This project provides template scripts to prepare and execute the humaneval interview, as well as result extraction scripts to help their evaluator. See https://github.com/openai/human-eval for more information.
All model answers and evaluation results are now included inside this repository! Install a recent release of streamlit pip install streamlit==1.23
then streamlit run app.py
or streamlit run compare-app.py
to run the above webapps locally.
🚧 humaneval/ development work is currently paused, there's other projects that are much further along.
See https://github.com/my-other-github-account/llm-humaneval-benchmarks and https://github.com/abacaj/code-eval for large lists of Humaneval LLM benchmark results.
junior-v2/*.yaml
- junior coder interview questions (stable)senior/*.yaml
- senior coder interview questions (WIP)
prompts/*.txt
- LLM prompt templates for the various modelsprepare.py
- Applies templates to question turning them into language- and model-specific prompts suitable for interview
See prompts/ for all prompts references in the leaderboard.
params/*.json
- Sampling hyper-parameter sets (used by all interview scripts)interview-*.py
- Interview scripts
See params/ for all params references in the leaderboard.
evaluate.py
- Run tests for the generated code in a sandbox and grades each answerapp.py
- Streamlit webapp to explore results, see https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
compare.py
- Performs comparisons between evaluations, optionally calling out to an LLM for analysiscompare-app.py
- Streamlit webapp to explore comparisons, see https://huggingface.co/spaces/mike-ravkine/can-ai-code-comparecompare/*.yaml
- Compare configurationscompare/*.json
- Compare results
API Runtime | Script |
---|---|
LiteLLM (OpenAI, etc..) | interview-litellm.py |
OobaBooga/KoboldCpp | interview-oobabooga.py |
Huggingface Inference | interview-hfinference.py |
Gradio (HF Spaces) | interview-gradio.py |
Quantization Type | Script | Dependency |
---|---|---|
GGUF | interview-llamacpp.py |
llamacpp or ggml binary |
GPTQ (AutoGptQ) | interview-cuda.py |
auto-gptq==0.6.0 |
GPTQ (ExLlama) | interview-cuda.py |
exllama @ 3b013cd53c7d413cf99ca04c7c28dd5c95117c0d |
EXL2, GPTQ (ExLlama2) | interview-cuda.py |
exllamav2 @ 0.0.12 |
HQQ | interview-cuda.py |
hqq @ 0.1.1 |
AWQ, FP16 (vLLM) | interview-cuda.py |
vllm==0.3.0 |
CTranslate2 | interview-cuda.py |
ctranslate2>=3.16.0 |
bitsandbytes | interview-cuda.py |
bitsandbytes==0.41.3 |
FP16 (Transformers) | interview-cuda.py |
transformers==4.37.2 |
The recommended modal wrapper is interview_modal_cuda11.py
which builds a CUDA11.8 based container with all the above dependencies working. An interview_modal_cuda12.py
is also provided, but AutoGPTQ and CTranslate2 are not compatible.
Unfortunately the nature of Modal does not allow command-line selection of eitehr LLM model or runtime engine.
To select models, open the script and uncomment the .run_function(download...)
line of choice. Note that only one model can be selected at a time. To add a new model, implement a new download...
function.
To select runtime, open the script and uncomment one of the RUNTIME
options. Note that for transformers
you must also specify QUANT
.
A set of interview questions is a folder of .yaml files. Each Question is a top-level key:
SanityList:
Signature: "things()"
Input: "with no inputs"
Output: "a list with three values: the number 5, the string 'foobar', the capital city of Spain"
Fact: "the capital city of Spain is Madrid"
Description: "List function, see if the model can combine input facts with internal knowledge."
Checks:
input_name:
assert: "f.name"
eq: "things"
In this example SanityList
is the name of the interview question.
The first four fields are used by prepare.py
to create the interview:
Signature
is the desired function signatureInput
describes the function inputsOutput
describes the function outputsFact
is optional and provides any context that is required to correctly perform the task
These 4 variables along with language
(either python
or javascript
) are used to expand templates in prompts/
.
The last two fields are used by evaluate.py
to judge the results:
Description
is a human-readable explanation of why this test is usefulChecks
defines the expected behavior of the output.
Each check has a name, some assert
value (python code) and an expected eq
value.
The f object represents the sandbox view of the function. Static analysis is performed on the function signature to extract the f.name
and f.args
fields, while f.call
allows for function evaluation.
All scripts output automatically named .ndjson files to the results/
directory.
Each stage outputs a super-set of fields from the stage before it, so its possible to feed eval/interview back to interview (to re-run the questions) or back to eval (to re-run the eval).
results/prepare_{interview}_{languages}_{template}.ndjson
Fields:
- all Question fields (Signature, Input, Output, Fact, Description)
- name
- language
- prompt
results/interview_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson
Fields:
- all
prepare
fields - model
- params
- answer
- runtime
results/eval_{interview}_{languages}_{template}_{templateout}_{params}_{model}_{timestamp}.ndjson
Fields:
- all
eval
fields - status
- passed
- total
- checks