Baseline benchmark for 17 coding models : r/LocalLLaMA #456
Labels
code-generation
code generation models and tools like copilot and aider
llm-benchmarks
testing and benchmarking large language models
llm-evaluation
Evaluating Large Language Models performance and behavior through human-written evaluation sets
llm-experiments
experiments with large language models
MachineLearning
ML Models, Training and Inference
prompt-engineering
Developing and optimizing prompts to efficiently use language models for various applications and re
Baseline Benchmark for 17 Coding Models
Discussion
I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.
Notes:
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""
Results
I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.
Suggested labels
{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }
The text was updated successfully, but these errors were encountered: