MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

MultiChallenge is a novel benchmark designed to evaluate large language models (LLMs) on their ability to handle multi-turn conversations with human users—an essential but underexplored capability for their real-world applications. MultiChallenge focuses on four key categories of challenges that are common, realistic, and highly demanding in current human-LLM interactions. These challenges require LLMs to excel simultaneously in accurate context allocation, in-context reasoning, and instruction-following.

Project Structure

data/ : Contains input files for conversations (benchmark_questions.jsonl) and optional model response files (in final_model_responses) used in the benchmark.
results/ : Stores the benchmark's output, including evaluation scores and metrics, saved to evaluation_results.txt.
src/ : Core functionality for the benchmark
- models/ : Houses model provider classes:

Setup Instructions

Clone the Repository

git clone some_directory
cd multi-challenge

Install Requirements
```
pip install -r requirements.txt
```

Create .env File Create a .env file in the root directory with your API keys. For example:

OPENAI_API_KEY=your-openai-api-key (REQUIRED)
HUGGINGFACE_TOKEN=your-huggingface-token

Usage

1. Using Pre-Generated Responses

If you already have model responses and want to evaluate them:

python main.py --responses-file data/model_responses.jsonl --output-file results/evaluation_results.txt

Make sure to format the responses file as shown in data/responses_template.jsonl

2. Generating Responses with a Model

To dynamically generate responses using a supported model provider:

python main.py --model-provider openai --provider-args model=gpt-4o temp=0 --output-file results/evaluation_results.txt

3. Using Multiple Attempts

To evaluate model performance with multiple attempts per conversation:

python main.py --model-provider openai --attempts 3 --output-file results/evaluation_results.txt

This will generate 3 responses per conversation and consider it successful if any attempt passes.

4. Generating Detailed Raw Output

To save comprehensive evaluation details including all responses and judgments:

python main.py --model-provider openai --attempts 3 --output-file results/evaluation_results.txt --raw results/detailed_results.csv

Command-Line Arguments

--output-file: Path to save the final evaluation results.
--responses-file: Path to a file containing pre-generated responses. (OPTIONAL)
--model-provider: Specify the model provider for generating responses (huggingface, openai, etc.).
--provider-args: Model-specific arguments in key=value format (e.g., model_path=/path/to/model).
--attempts: Number of attempts to generate for each conversation. Defaults to 1.
--max-workers_response_gen: Number of concurrent workers to multi-thread response generation. Defaults to 1.
--max-workers_eval: Number of concurrent workers to multi-thread response evaluation. Defaults to 1.
--raw: Path to save detailed raw output including all responses and evaluations. (OPTIONAL)

Evaluation Results

The evaluation results include:

In evaluation_results.txt:
- Overall Score: Percentage of conversations where at least one attempt meets the criteria
- Axis Scores: Per-axis scores based on number of attempts
In detailed_results.txt (if --raw is specified):
- Complete conversation history
- All model responses for each attempt
- Judge's verdicts and reasoning
- Expected pass criteria
- Per-conversation pass/fail statistics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Project Structure

Setup Instructions

Usage

1. Using Pre-Generated Responses

2. Generating Responses with a Model

3. Using Multiple Attempts

4. Generating Detailed Raw Output

Command-Line Arguments

Evaluation Results

Project Dependencies

See `requirements.txt` for a complete list of required packages.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
src		src
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Uh oh!

Uh oh!

ekwinox117/multi-challenge

Folders and files

Latest commit

History

Repository files navigation

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Project Structure

Setup Instructions

Usage

1. Using Pre-Generated Responses

2. Generating Responses with a Model

3. Using Multiple Attempts

4. Generating Detailed Raw Output

Command-Line Arguments

Evaluation Results

Project Dependencies

See requirements.txt for a complete list of required packages.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

See `requirements.txt` for a complete list of required packages.

Packages