LLM-Eval

LLM-Eval is a Streamlit-based application for evaluating Large Language Model (LLM) pipelines against predefined or custom datasets using various metrics. It allows you to:

Compare model outputs to ground-truth answers using exact match, ROUGE, BLEU, semantic similarity, and LLM-based (GPT-style) criteria.
Use live models (local or remote) or precomputed responses.
Effortlessly upload your own custom dataset in CSV/JSON format.
Use local models served with Ollama or OpenAI compatible API endpoints.

Features

Multiple Evaluation Methods
Evaluate your model outputs using:
- Exact Match: Checks if the response text matches exactly (case-insensitive).
- Overlap Metrics: Uses ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) and BLEU scores.
- Semantic Similarity: Computes cosine similarity via SentenceTransformers.
- LLM Criteria: Leverages an LLM to "judge" answers based on custom or default prompts.
Live vs. Precomputed Responses
- Live Model: Query a model in real-time (e.g. GPT-4o or Ollama models).
- Precomputed Responses: Upload previously generated answers for offline or batch scoring.
Easy Integration
- Simple UI with Streamlit.
- Automatic caching of embeddings and partial results.
- Docker-ready for quick deployment.

Project Structure

.
├── Dockerfile
├── README.md
├── data
│   ├── custom               # Any custom uploaded data will be stored here
│   └── predefined           # Built-in example datasets
│       ├── gsm8k_100.parquet
│       └── mathqa_100.parquet
├── example.env              # Example environment file
├── pyproject.toml
├── src
│   ├── app.py               # Streamlit application entry point
│   ├── config.py            # Default settings and prompts
│   ├── dataset_loader.py
│   ├── grading.py           # Evaluation/Scoring logic
│   ├── utils.py

Requirements and Installation

Python >= 3.11
uv >= 0.5.26 (How to install uv)
(Optional) Docker for container-based deployment

Install dependencies locally (without Docker)

# clone the repository
git clone https://github.com/brotSchimmelt/llm-eval.git
cd llm-eval

# create and activate a virtual environment (recommended)
uv venv --python 3.11

# install dependencies
uv sync
source .venv/bin/activate

Usage

Local Environment

Set up environment variables (optional):
For example, save your OpenAI API key in an .env file.
Run the Streamlit application:
```
streamlit run src/app.py
```
Access the app:
Open your web browser at http://localhost:8501.

Docker

Build or download the Docker image:

docker build -t llm-eval .
# or
docker pull ghcr.io/brotschimmelt/llm-eval:latest

Run the container:

docker run -p 8501:8501 --env-file .env llm-eval

Access the app:
Open your browser at http://localhost:8501.

Cloud GPU

You can also use terraform to provision a cloud GPU instance and run the app there. The repo contains a main.tf file with the necessary configuration for lambda labs. Configure all variables including the API key in terraform.tfvars. Add an ssh key to your account and set the ssh_key_name variable to the key's name. Also set the machine_name variable to the name you want to give to your machine.

Then you can run the following commands:

terraform init

# to plan changes
terraform plan

# apply the changes and provision the instance
terraform apply -var-file="terraform.tfvars"

# to destroy the instance
terraform destroy -var-file="terraform.tfvars"

Access the app at your-ip:8501.

Application Workflow

Once the app is running locally or in a container:

Select a Dataset:
- Sample Dataset: Preloaded small data to test if the models run.
- Predefined Dataset: Choose from built-in .parquet files in data/predefined/.
- Upload Custom Dataset: Upload your own CSV/JSON with question and ground_truth columns.
Choose a Pipeline Mode in the sidebar:
- Live Model: Configure and query an LLM in real-time.
- Precomputed Responses: Upload a CSV/JSON containing pre-generated answers.
Select Evaluation Method:
- Exact Match, Overlap Metrics, Semantic Similarity, LLM Criteria, or Combined.
Run Evaluation: Click “Run Evaluation” and review the scores.

Configuration

The default configurations are found in config.py. Key settings include:

Paths for predefined and custom dataset directories.
Default model ("gpt-4o-mini") and sampling parameters (top_p, temperature, etc.).
fallback_criteria for auto-generating an LLM-based grading prompt if none is provided.

Adding Your Own Datasets

Custom:
- Place a CSV or JSON with question and ground_truth columns in data/custom/.
- Or upload it via the UI.
Predefined:
- Convert your dataset to .parquet format.
- Place it in data/predefined/.
- Restart the app to see your dataset listed.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
data		data
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
example.env		example.env
main.tf		main.tf
pyproject.toml		pyproject.toml
terraform.tfvars.example		terraform.tfvars.example
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Eval

Features

Project Structure

Requirements and Installation

Install dependencies locally (without Docker)

Usage

Local Environment

Docker

Cloud GPU

Application Workflow

Configuration

Adding Your Own Datasets

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Languages

License

brotSchimmelt/llm-eval

Folders and files

Latest commit

History

Repository files navigation

LLM-Eval

Features

Project Structure

Requirements and Installation

Install dependencies locally (without Docker)

Usage

Local Environment

Docker

Cloud GPU

Application Workflow

Configuration

Adding Your Own Datasets

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Languages

Packages