Welcome to BenchWeaver! 🎉🔬 This Python project provides a specialized benchmarking pipeline, supporting various models and benchmarks. ⚙️🔧📈
This is the official repository of the master's thesis:
"BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages"
The pipeline overview is demonstrate as follows:

Warning
Some of the benchmarks have custome settings, check configs or support benchmarks for more details.
Create a new conda environment and install the package:
conda create --name BenchWeaver python=3.11 -y
pip install unbabel-comet
pip install -e .Warning
Package unbabel-comet should be installed before pip install -e .
Note
After installation, remember to create env/tokens.env for saving your HuggingFace and OpenAI's configuration.
For example:
HF_TOKEN=""
AZURE_ENDPOINT_URL=""
AZURE_OPENAI_API_KEY=""
AZURE_API_VERSION=""
OPENAI_API_KEY=""
OPENAI_ORGANIZATION=""
OPENAI_PROJECT=""
Access detailed documentation through these links:
| Component | Description | Link |
|---|---|---|
| CLI | Command-line interface guide | CLI |
| Config | Evaluation configuration details | Config |
| Evaluation | Methods and metrics explanation | Evaluation Method |
| Benchmarks | List of supported benchmarks | Support Benchmark |
| Benchmark Type | List of benchmark types | Benchmark Classification |
| Add benchmark | Details how to add benchmark | Add Benchmark |
| Problem Record | Record the problem occured | Problem Record |
| Model Usage Method | Load the LLMs in different mode | Model Usage Method |
For pipeline execution, you can run the configurations for each part as listed below:
| Chapter | Experiment/Detail | Configuration Link |
|---|---|---|
| Main Result | - | Main Result |
| Ablation Study | Translation Prompt | Translation Prompt |
| Ablation Study | Compare with P-MMEval | Compare with P-MMEval |
For checking the translation quality, you can execute the following code for reproduce:
bash scripts/bash/eval_trans.sh- Allow using local deplyment model endpoint. (Now support Gemini models as well!)
- Allow using Hugging Face hub dataset.
- The datasets in
evaluation_datawill be upload to my Hugging Face in future. - Newer version of datasets will not appear in the
evaluation_data.
- The datasets in
- Update the Documation.
- Update config_doc.md
- Update supported_benchmark.md
Since the thesis has not yet been included in the National Digital Library of Theses and Dissertations in Taiwan, the following temporary citation format is provided:
@misc{benchweaver,
title = {BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages},
author = {梁致銓 (Joey Liang)},
howpublished = {\url{https://github.com/joeyliang1024/BenchWeaver}},
year = {2025}
}