Litmus

Specification testing for structured LLM outputs.

Litmus lets you define test cases with input strings and expected JSON outputs, run them against LLM models via OpenRouter, and compare accuracy, latency, and throughput across models.

Example output

$ litmus run --tests example/tests.json --schema example/schema.json --prompt-file example/prompt.txt --model openai/gpt-4.1-nano --model mistralai/mistral-nemo                 
Running 2 tests against openai/gpt-4.1-nano...
Running 2 tests against mistralai/mistral-nemo...

Litmus Test Report
──────────────────────────────────────────────────
Timestamp: 2025-12-27T16:19:30Z
Test File: example/tests.json
Schema:    example/schema.json

Model: openai/gpt-4.1-nano
──────────────────────────────────────────────────
Provider: OpenAI
Results:  2 passed / 0 failed (100.0% accuracy)
Tokens:   148 in / 34 out
Latency:  P50=363ms  P95=454ms  P99=462ms
Duration: 2.11s (16.1 tok/s)

┌────────────────────────┬────────┬─────────┬────────┐
│          TEST          │ STATUS │ LATENCY │ TOKENS │
├────────────────────────┼────────┼─────────┼────────┤
│ Extract person info    │ ✓ PASS │ 263ms   │ 74/17  │
│ Extract another person │ ✓ PASS │ 464ms   │ 74/17  │
└────────────────────────┴────────┴─────────┴────────┘

Model: mistralai/mistral-nemo
──────────────────────────────────────────────────
Provider: Mistral
Results:  2 passed / 0 failed (100.0% accuracy)
Tokens:   64 in / 56 out
Latency:  P50=254ms  P95=262ms  P99=263ms
Duration: 763ms (73.4 tok/s)

┌────────────────────────┬────────┬─────────┬────────┐
│          TEST          │ STATUS │ LATENCY │ TOKENS │
├────────────────────────┼────────┼─────────┼────────┤
│ Extract person info    │ ✓ PASS │ 246ms   │ 32/28  │
│ Extract another person │ ✓ PASS │ 263ms   │ 32/28  │
└────────────────────────┴────────┴─────────┴────────┘

Model Comparison
──────────────────────────────────────────────────
┌────────────────────────┬──────────┬──────────┬──────────────┬─────────┬────────┐
│         MODEL          │ PROVIDER │ ACCURACY │ P 50 LATENCY │ TOK / S │ TOKENS │
├────────────────────────┼──────────┼──────────┼──────────────┼─────────┼────────┤
│ openai/gpt-4.1-nano    │ OpenAI   │ 100.0%   │ 363ms        │ 16.1    │ 182    │
│ mistralai/mistral-nemo │ Mistral  │ 100.0%   │ 254ms        │ 73.4    │ 120    │
└────────────────────────┴──────────┴──────────┴──────────────┴─────────┴────────┘

Installation

Download a pre-built binary from the latest release, or install with Go:

go install go.carr.sh/litmus@latest

Or compile from source:

git clone https://github.com/lukecarr/litmus.git
cd litmus
go build -o litmus .

Quick Start

Set your OpenRouter API key:

export OPENROUTER_API_KEY="your-api-key"

Create a test file (tests.json):

[
  {
    "name": "Extract person info",
    "input": "John Smith is 30 years old and works at Acme Corp",
    "expected": {
      "name": "John Smith",
      "age": 30,
      "company": "Acme Corp"
    }
  },
  {
    "name": "Extract another person",
    "input": "Jane Doe, age 25, is employed by TechStart Inc",
    "expected": {
      "name": "Jane Doe",
      "age": 25,
      "company": "TechStart Inc"
    }
  }
]

Create a JSON schema (schema.json):

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "integer" },
    "company": { "type": "string" }
  },
  "required": ["name", "age", "company"],
  "additionalProperties": false
}

Create a prompt file (prompt.txt):

Extract the person's name, age, and company from the given text.

Run tests:

litmus run --tests tests.json --schema schema.json --prompt-file prompt.txt --model openai/gpt-4.1-nano

Usage

Basic Command

litmus run --tests <test-file> --schema <schema-file> --prompt <prompt> --model <model>

Flags

Flag	Short	Description
`--tests`	`-t`	Path to test cases JSON file (required)
`--schema`	`-s`	Path to JSON schema file (required)
`--prompt`	`-p`	System prompt for the LLM
`--prompt-file`		Path to file containing system prompt
`--model`	`-m`	Model to test against (required, can be repeated)
`--parallel`	`-P`	Number of parallel requests per model (default: 1)
`--json`		Output results as JSON
`--api-key`		OpenRouter API key (or use OPENROUTER_API_KEY env var)

Examples

Single model:

litmus run \
  --tests tests.json \
  --schema schema.json \
  --prompt-file prompt.txt \
  --model openai/gpt-4.1-nano

Multiple models for comparison:

litmus run \
  --tests tests.json \
  --schema schema.json \
  --prompt "Extract entities from the text" \
  --model openai/gpt-4.1-nano \
  --model mistralai/mistral-nemo

Parallel execution:

litmus run \
  --tests tests.json \
  --schema schema.json \
  --prompt-file prompt.txt \
  --model openai/gpt-4.1-nano \
  --parallel 5

JSON output for CI/CD:

litmus run \
  --tests tests.json \
  --schema schema.json \
  --prompt-file prompt.txt \
  --model openai/gpt-4.1-nano \
  --json > results.json

Test File Format

The test file is a JSON array of test cases:

[
  {
    "name": "Test name (for display)",
    "input": "The input text to send to the LLM",
    "expected": {
      "field1": "expected value",
      "field2": 123
    }
  }
]

name: A human-readable name for the test case
input: The user message sent to the LLM
expected: The expected JSON output (must match the schema)

JSON Schema

The schema file should be a valid JSON Schema. It is passed to OpenRouter's response_format parameter to enforce structured output from the LLM.

Example schema:

{
  "type": "object",
  "properties": {
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1
    }
  },
  "required": ["sentiment", "confidence"],
  "additionalProperties": false
}

Output

Terminal Output

The terminal output includes:

Provider used for each model
Summary metrics (pass/fail counts, accuracy %)
Token usage and throughput (tokens/second)
Latency percentiles (P50, P95, P99)
Detailed test results table
Field-level diff for failures
Model comparison table (when testing multiple models)

JSON Output

Use --json to get machine-readable output:

{
  "timestamp": "2025-12-27T16:19:30Z",
  "prompt": "Extract entities...",
  "schema_file": "schema.json",
  "test_file": "tests.json",
  "models": [
    {
      "model": "openai/gpt-4.1-nano",
      "results": [...],
      "metrics": {
        "total_tests": 10,
        "passed": 9,
        "failed": 1,
        "accuracy": 90.0,
        "latency_p50_ms": 450,
        "throughput_tps": 25.5
      }
    }
  ]
}

Exit Codes

0: All tests passed
1: One or more tests failed or errored

Supported Models

Litmus works with any model available on OpenRouter.

License

Litmus is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
example		example
internal		internal
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Litmus

Example output

Installation

Quick Start

Usage

Basic Command

Flags

Examples

Test File Format

JSON Schema

Output

Terminal Output

JSON Output

Exit Codes

Supported Models

License

About

Uh oh!

Releases 2

Uh oh!

Contributors 2

Languages

License

lukecarr/litmus

Folders and files

Latest commit

History

Repository files navigation

Litmus

Example output

Installation

Quick Start

Usage

Basic Command

Flags

Examples

Test File Format

JSON Schema

Output

Terminal Output

JSON Output

Exit Codes

Supported Models

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 2

Languages