Skip to content

joeyliang1024/BenchWeaver

Repository files navigation

BenchWeaver 🏆🚀🐍

Welcome to BenchWeaver! 🎉🔬 This Python project provides a specialized benchmarking pipeline, supporting various models and benchmarks. ⚙️🔧📈

This is the official repository of the master's thesis:

"BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages"

Pipeline Overview

The pipeline overview is demonstrate as follows: Pipeline Overview

Warning

Some of the benchmarks have custome settings, check configs or support benchmarks for more details.

Installation 💻⚡

Create a new conda environment and install the package:

conda create --name BenchWeaver python=3.11 -y
pip install unbabel-comet
pip install -e .

Warning

Package unbabel-comet should be installed before pip install -e .

Note

After installation, remember to create env/tokens.env for saving your HuggingFace and OpenAI's configuration.

For example:

HF_TOKEN=""
AZURE_ENDPOINT_URL=""
AZURE_OPENAI_API_KEY=""
AZURE_API_VERSION=""
OPENAI_API_KEY=""
OPENAI_ORGANIZATION=""
OPENAI_PROJECT=""

Documentation 📚📝

Access detailed documentation through these links:

Component Description Link
CLI Command-line interface guide CLI
Config Evaluation configuration details Config
Evaluation Methods and metrics explanation Evaluation Method
Benchmarks List of supported benchmarks Support Benchmark
Benchmark Type List of benchmark types Benchmark Classification
Add benchmark Details how to add benchmark Add Benchmark
Problem Record Record the problem occured Problem Record
Model Usage Method Load the LLMs in different mode Model Usage Method

Reproducibility of Results

For pipeline execution, you can run the configurations for each part as listed below:

Chapter Experiment/Detail Configuration Link
Main Result - Main Result
Ablation Study Translation Prompt Translation Prompt
Ablation Study Compare with P-MMEval Compare with P-MMEval

For checking the translation quality, you can execute the following code for reproduce:

bash scripts/bash/eval_trans.sh

Enhancement

Reference

Since the thesis has not yet been included in the National Digital Library of Theses and Dissertations in Taiwan, the following temporary citation format is provided:

@misc{benchweaver,
    title        = {BenchWeaver: An Automated Multilingual Evaluation Framework with LLM-as-Judge and Translation Prompting for Low-Resource Languages},
    author       = {梁致銓 (Joey Liang)},
    howpublished = {\url{https://github.com/joeyliang1024/BenchWeaver}},
    year         = {2025}
}

About

LLM 評估框架,支援 30+ Benchmarks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published