Skip to content

hev/vibecheck

Repository files navigation

vibecheck CLI

Tests Coverage npm version License

An agent evaluation framework for any LLM - A simple and intuitive YAML based DSL for agent evals.

vibecheck makes it easy to evaluate any language model with a simple YAML configuration. Run evals, save the results, and tweak your system prompts with incredibly tight feedback loop from the command line.

Get Your Invite

vibe check is currently being offered as an invite-only developer preview! Read our FAQ and request your API key at vibescheck.io.

Installation

npm install -g vibecheck-cli

Get your API key at vibescheck.io

Quick Start

Create a simple evaluation file:

# hello-world.yaml
metadata:
  name: hello-world
  model: anthropic/claude-3.5-sonnet

evals:
  - prompt: Say hello
    checks:
      - match: "*hello*"
      - min_tokens: 1
      - max_tokens: 50

Run the evaluation:

vibe check -f hello-world.yaml

Output:

hello-world  ----|+++++  βœ… in 2.3s

hello-world: Success Pct: 2/2 (100.0%)

Documentation

Core Documentation

Quick Links

Essential Commands

Run Evaluations

vibe check -f hello-world.yaml                    # Run from file
vibe check my-suite                               # Run saved suite
vibe check -f my-eval.yaml -m "openai*,anthropic*" # Multi-model comparison

β†’ Full CLI Reference

Manage Suites

vibe set -f my-eval.yaml        # Save a suite
vibe get suites                 # List all suites
vibe get suite <name>          # Get specific suite

View Results

vibe get runs                           # List all runs
vibe get runs --sort-by price-performance # Compare models by score
vibe get runs --suite my-suite         # Filter by suite

Manage Variables & Secrets

vibe var set <name> <value>      # Set a variable
vibe secret set <name> <value>   # Set a secret (write-only)
vibe get vars                    # List all variables

β†’ Full CLI Reference

Featured Examples

🌍 Multilingual Testing

Test your model across 10+ languages:

metadata:
  name: multilingual-pbj
  model: meta-llama/llama-4-maverick
  system_prompt: "You are a translator. Respond both in the language the question is asked as well as English."

evals:
  - prompt: "Describe how to make a peanut butter and jelly sandwich."
    checks:
      - match: "*bread*"
      - llm_judge:
          criteria: "Does this accurately describe how to make a PB&J in English"
      - min_tokens: 20
      - max_tokens: 300

πŸ”§ MCP Tool Integration

Test MCP tool calling with secure configuration:

# Set up secrets and variables
vibe set secret linear.apiKey "your-api-key"
vibe set var linear.projectId "your-project-id"

# Run the evaluation
vibe check linear-mcp

🧠 Advanced Patterns

Combine multiple check types:

evals:
  - prompt: How are you today?
    checks:
      - semantic:
          expected: "I'm doing well, thank you for asking"
          threshold: 0.7
      - llm_judge:
          criteria: "Is this a friendly and appropriate response?"
      - min_tokens: 10
      - max_tokens: 100

β†’ More Examples

YAML Syntax Reference

vibecheck evaluations are defined in YAML with a simple, intuitive syntax.

Quick Reference

Check Types:

  • match - Glob pattern matching
  • not_match - Negated patterns
  • or - OR logic for multiple patterns
  • min_tokens / max_tokens - Token length constraints
  • semantic - Semantic similarity using embeddings
  • llm_judge - LLM-based quality evaluation

Example:

metadata:
  name: my-eval
  model: anthropic/claude-3.5-sonnet

evals:
  - prompt: What is 2+2?
    checks:
      - or:
          - match: "*4*"
          - match: "*four*"
      - min_tokens: 1
      - max_tokens: 20

β†’ Full YAML Syntax Reference

Model Comparison

Run evaluations on multiple models and compare results:

# Run on specific models
vibe check -f my-eval.yaml -m "openai/gpt-4,anthropic/claude-3.5-sonnet"

# Run on all OpenAI models
vibe check -f my-eval.yaml -m "openai*"

# Run on all models
vibe check -f my-eval.yaml -m all

# View results sorted by score
vibe get runs --sort-by price-performance

β†’ Model Comparison Guide

Programmatic API

Use vibecheck in your code and tests:

import { runVibeCheck } from '@vibecheck/runner';
import { extendExpect } from '@vibecheck/runner/jest';

extendExpect(expect);

describe('My LLM Feature', () => {
  it('should pass all vibe checks', async () => {
    const results = await runVibeCheck({
      file: './evals/my-feature.yaml'
    });

    expect(results).toHavePassedAll();
  });
});

β†’ Programmatic API Guide

Success Rates

Success rates are displayed as percentages with color coding:

  • Green (>80% pass rate) - High success rate
  • Yellow (50-80% pass rate) - Moderate success rate
  • Red (<50% pass rate) - Low success rate

Individual Check Results:

  • βœ… PASS - Check passed
  • ❌ FAIL - Check failed

Exit Codes:

  • 0 - Moderate or high success rate (β‰₯50% pass rate)
  • 1 - Low success rate (<50% pass rate)

Contributing

We welcome contributions! See CONTRIBUTING.md for development guidelines.

Development Setup:

# Install dependencies
npm install

# Build packages
npm run build

# Run tests
npm test

# Run CLI locally
npm run start -- check -f examples/hello-world.yaml

License

Apache 2.0 - See LICENSE for details.


Wanna check the vibe? Get started at vibescheck.io πŸš€