FlexEval LLM Evals

FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.

Documentation: https://digitalharborfoundation.github.io/FlexEval

Additional details about FlexEval can be found in our paper at the Educational Data Mining 2024 conference.

Usage

Basic usage:

import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config

data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
    data_sources=data_sources,
    database_path="eval_results.db",
    eval=eval,
    config=config,
)
flexeval.run(eval_run)

This example computes Flesch reading ease for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called eval_results.db.

See additional usage examples in the vignettes.

Installation

FlexEval is on PyPI as python-flexeval. See the Installation section in the Getting Started guide.

Using pip:

pip install python-flexeval

Basic functionality

FlexEval is designed to be "batteries included" for many basic use cases. It supports the following out-of-the-box:

scoring historical conversations - useful for monitoring live systems.
scoring LLMs:
- locally hosted and served via an endpoint using something like LM Studio
- LLMs accessible by a REST endpoint and accessible via a network call
- any OpenAI LLM
a set of useful rubrics
a set of useful Python functions

Evaluation results are saved in an SQLite database. See the Metric Analysis vignette for a sample analysis demonstrating the structure and utility of the data saved by FlexEval.

Cite this work

If this work is useful to you, please cite our EDM 2024 paper:

S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. Proceedings of the 17th International Conference on Educational Data Mining, 903-908. Atlanta, Georgia, USA, July 2024. https://doi.org/10.5281/zenodo.12729993

Development

Pull requests are welcome. Feel free to contribute:

New rubrics or functions
Bug fixes
New features

See DEVELOPMENT.md.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github		.github
.vscode		.vscode
data		data
docs		docs
example_project		example_project
logs		logs
src		src
tests		tests
vignettes		vignettes
.env-example		.env-example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
EDM_2024_FlexEval.pdf		EDM_2024_FlexEval.pdf
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
make.bat		make.bat
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlexEval LLM Evals

Usage

Installation

Basic functionality

Cite this work

Development

About

Uh oh!

Releases 3

Uh oh!

Contributors 6

Uh oh!

Languages

License

DigitalHarborFoundation/FlexEval

Folders and files

Latest commit

History

Repository files navigation

FlexEval LLM Evals

Usage

Installation

Basic functionality

Cite this work

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors 6

Uh oh!

Languages