Multi-Turn Evaluation Framework

This repository contains a Python implementation of a multi-turn evaluation benchmark for large language model (LLM). The benchmark is designed to evaluate the performance of LLM models' capabilities in multi-turn instruction following within a multilingual environment.

Files

The repository contains the following files:

api_client.py: This file contains the implementation of the interface for LLMs interactions via API calls.
ifeval.py: This file contains the implementation of the Inference Evaluation (IFEVAL) metric, which is used to evaluate the capability of LLM following natural language instructions
metrics.py: This file contains the implementation of various metrics that can be used to calculate ifeval, data preprocess and enrichment for multi turn instructions.
utils.py: This file contains utility functions that are used throughout the framework, e.g., GenerationSetting, get_inference_batch(),preprocess_data.
multi_turn_instruct_following_eval_api.py: This file contains the main function that executes the multi-turn evaluation benchmark via API calls.
multi_turn_instruct_following_eval_vllm.py: This file contains the main function that executes the multi-turn evaluation benchmark via running on local GPUs.

Usage

To use the multi-turn evaluation benchmar, follow these steps:

Clone the repository:

git clone https://github.com/your-username/multi-turn-evaluation.git

Install the required dependencies:

cd multi_turn_eval
pip install -r requirements.txt

Download the data from huggingface:

git clone https://huggingface.co/datasets/facebook/Multi-IF data/Multi-IF

Run the main function in multi_turn_instruct_following_eval_vllm.py:

python multi_turn_instruct_following_eval_vllm.py \
        --model_path <MODEL_PATH> \
        --tokenizer_path <TOKENIZER_PATH> \
        --input_data_csv <INPUT_DATA_CSV> \
        --batch_size <BATCH_SIZE> \
        --tensor_parallel_size <TENSOR_PARALLEL_SIZE>

This will execute the multi-turn evaluation benchmar and output the results to the console and intermediate generation results saved in csv files.

For example, for Meta-Llama-3.1-70B-Instruct,

python multi_turn_instruct_following_eval_vllm.py \
        --model_path meta-llama/Llama-3.1-70B-Instruct \
        --tokenizer_path meta-llama/Llama-3.1-70B-Instruct \
        --input_data_csv data/Multi-IF/multiIF_20241018.csv \
        --batch_size 4 \
        --tensor_parallel_size 8

Or for running evaluation via API please use

Run the main function in multi_turn_instruct_following_eval_api.py with claude-3.5-sonnet-20240620:

python multi_turn_instruct_following_eval_api.py \
        --max_workers 5 \
        --api_model_name claude-3.5-sonnet-20240620 \
        --input_data_csv data/Multi-IF/multiIF_20241018.csv \
        --max_new_tokens 1024 \
        --steps 1 2 3

Bibtex

If you use the code or benchmark, please consider citing the following paper:

@article{he2024multi,
  title={Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following},
  author={He, Yun and Jin, Di and Wang, Chaoqi and Bi, Chloe and Mandyam, Karishma and Zhang, Hejia and Zhu, Chen and Li, Ning and Xu, Tengyu and Lv, Hongjiang and others},
  journal={arXiv preprint arXiv:2410.15553},
  year={2024}
}

Contributing

We welcome contributions to this repository! If you have any suggestions or improvements, please open an issue or submit a pull request.

License

This project is licensed under the Apache License, Version 2.0 (the "License"). See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Turn Evaluation Framework

Files

Usage

Bibtex

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
api_client.py		api_client.py
ifeval.py		ifeval.py
metrics.py		metrics.py
multi_turn_instruct_following_eval_api.py		multi_turn_instruct_following_eval_api.py
multi_turn_instruct_following_eval_vllm.py		multi_turn_instruct_following_eval_vllm.py
requirements.txt		requirements.txt
utils.py		utils.py

License

facebookresearch/Multi-IF

Folders and files

Latest commit

History

Repository files navigation

Multi-Turn Evaluation Framework

Files

Usage

Bibtex

Contributing

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages