BenchWeaver 🏆🚀🐍

Welcome to BenchWeaver! 🎉🔬 This Python project provides a specialized benchmarking pipeline, supporting various models and benchmarks. ⚙️🔧📈

This is the official repository of the master's thesis:

"BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages"

Pipeline Overview

The pipeline overview is demonstrate as follows:

Warning

Some of the benchmarks have custome settings, check configs or support benchmarks for more details.

Installation 💻⚡

Create a new conda environment and install the package:

conda create --name BenchWeaver python=3.11 -y
pip install unbabel-comet
pip install -e .

Warning

Package unbabel-comet should be installed before pip install -e .

Note

After installation, remember to create env/tokens.env for saving your HuggingFace and OpenAI's configuration.

For example:

HF_TOKEN=""
AZURE_ENDPOINT_URL=""
AZURE_OPENAI_API_KEY=""
AZURE_API_VERSION=""
OPENAI_API_KEY=""
OPENAI_ORGANIZATION=""
OPENAI_PROJECT=""

Documentation 📚📝

Access detailed documentation through these links:

Component	Description	Link
CLI	Command-line interface guide	CLI
Config	Evaluation configuration details	Config
Evaluation	Methods and metrics explanation	Evaluation Method
Benchmarks	List of supported benchmarks	Support Benchmark
Benchmark Type	List of benchmark types	Benchmark Classification
Add benchmark	Details how to add benchmark	Add Benchmark
Problem Record	Record the problem occured	Problem Record
Model Usage Method	Load the LLMs in different mode	Model Usage Method

Reproducibility of Results

For pipeline execution, you can run the configurations for each part as listed below:

Chapter	Experiment/Detail	Configuration Link
Main Result	-	Main Result
Ablation Study	Translation Prompt	Translation Prompt
Ablation Study	Compare with P-MMEval	Compare with P-MMEval

For checking the translation quality, you can execute the following code for reproduce:

bash scripts/bash/eval_trans.sh

Enhancement

Allow using local deplyment model endpoint. (Now support Gemini models as well!)
Allow using Hugging Face hub dataset.
- The datasets in evaluation_data will be upload to my Hugging Face in future.
- Newer version of datasets will not appear in the evaluation_data.
Update the Documation.
- Update config_doc.md
- Update supported_benchmark.md

Reference

Since the thesis has not yet been included in the National Digital Library of Theses and Dissertations in Taiwan, the following temporary citation format is provided:

@misc{benchweaver,
    title        = {BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages},
    author       = {梁致銓 (Joey Liang)},
    howpublished = {\url{https://github.com/joeyliang1024/BenchWeaver}},
    year         = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
assets/img		assets/img
config		config
doc		doc
evaluation_data		evaluation_data
logs		logs
mapping/default		mapping/default
model/flores_spm		model/flores_spm
mxeval		mxeval
prompt		prompt
score		score
scripts		scripts
src/BenchWeaver		src/BenchWeaver
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BenchWeaver 🏆🚀🐍

Pipeline Overview

Installation 💻⚡

Documentation 📚📝

Reproducibility of Results

Enhancement

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

joeyliang1024/BenchWeaver

Folders and files

Latest commit

History

Repository files navigation

BenchWeaver 🏆🚀🐍

Pipeline Overview

Installation 💻⚡

Documentation 📚📝

Reproducibility of Results

Enhancement

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages