An agent evaluation framework for any LLM - A simple and intuitive YAML based DSL for agent evals.
vibecheck makes it easy to evaluate any language model with a simple YAML configuration. Run evals, save the results, and tweak your system prompts with incredibly tight feedback loop from the command line.
Get Your Invite
vibe check is currently being offered as an invite-only developer preview! Read our FAQ and request your API key at vibescheck.io.
npm install -g vibecheck-cliGet your API key at vibescheck.io
Create a simple evaluation file:
# hello-world.yaml
metadata:
name: hello-world
model: anthropic/claude-3.5-sonnet
evals:
- prompt: Say hello
checks:
- match: "*hello*"
- min_tokens: 1
- max_tokens: 50Run the evaluation:
vibe check -f hello-world.yamlOutput:
hello-world ----|+++++ β
in 2.3s
hello-world: Success Pct: 2/2 (100.0%)
- YAML Syntax Reference - Complete guide to evaluation syntax and check types
- CLI Reference - All CLI commands, options, and flags
- Examples - Featured examples and best practices
- Model Comparison & Scoring - Compare models and understand scoring
- Programmatic API - Use vibecheck in your code and tests
- Using with Claude Code - Skills and agent for Claude Code integration
vibe check -f hello-world.yaml # Run from file
vibe check my-suite # Run saved suite
vibe check -f my-eval.yaml -m "openai*,anthropic*" # Multi-model comparisonvibe set -f my-eval.yaml # Save a suite
vibe get suites # List all suites
vibe get suite <name> # Get specific suitevibe get runs # List all runs
vibe get runs --sort-by price-performance # Compare models by score
vibe get runs --suite my-suite # Filter by suitevibe var set <name> <value> # Set a variable
vibe secret set <name> <value> # Set a secret (write-only)
vibe get vars # List all variablesTest your model across 10+ languages:
metadata:
name: multilingual-pbj
model: meta-llama/llama-4-maverick
system_prompt: "You are a translator. Respond both in the language the question is asked as well as English."
evals:
- prompt: "Describe how to make a peanut butter and jelly sandwich."
checks:
- match: "*bread*"
- llm_judge:
criteria: "Does this accurately describe how to make a PB&J in English"
- min_tokens: 20
- max_tokens: 300Test MCP tool calling with secure configuration:
# Set up secrets and variables
vibe set secret linear.apiKey "your-api-key"
vibe set var linear.projectId "your-project-id"
# Run the evaluation
vibe check linear-mcpCombine multiple check types:
evals:
- prompt: How are you today?
checks:
- semantic:
expected: "I'm doing well, thank you for asking"
threshold: 0.7
- llm_judge:
criteria: "Is this a friendly and appropriate response?"
- min_tokens: 10
- max_tokens: 100vibecheck evaluations are defined in YAML with a simple, intuitive syntax.
Check Types:
match- Glob pattern matchingnot_match- Negated patternsor- OR logic for multiple patternsmin_tokens/max_tokens- Token length constraintssemantic- Semantic similarity using embeddingsllm_judge- LLM-based quality evaluation
Example:
metadata:
name: my-eval
model: anthropic/claude-3.5-sonnet
evals:
- prompt: What is 2+2?
checks:
- or:
- match: "*4*"
- match: "*four*"
- min_tokens: 1
- max_tokens: 20β Full YAML Syntax Reference
Run evaluations on multiple models and compare results:
# Run on specific models
vibe check -f my-eval.yaml -m "openai/gpt-4,anthropic/claude-3.5-sonnet"
# Run on all OpenAI models
vibe check -f my-eval.yaml -m "openai*"
# Run on all models
vibe check -f my-eval.yaml -m all
# View results sorted by score
vibe get runs --sort-by price-performanceUse vibecheck in your code and tests:
import { runVibeCheck } from '@vibecheck/runner';
import { extendExpect } from '@vibecheck/runner/jest';
extendExpect(expect);
describe('My LLM Feature', () => {
it('should pass all vibe checks', async () => {
const results = await runVibeCheck({
file: './evals/my-feature.yaml'
});
expect(results).toHavePassedAll();
});
});Success rates are displayed as percentages with color coding:
- Green (>80% pass rate) - High success rate
- Yellow (50-80% pass rate) - Moderate success rate
- Red (<50% pass rate) - Low success rate
Individual Check Results:
- β PASS - Check passed
- β FAIL - Check failed
Exit Codes:
0- Moderate or high success rate (β₯50% pass rate)1- Low success rate (<50% pass rate)
We welcome contributions! See CONTRIBUTING.md for development guidelines.
Development Setup:
# Install dependencies
npm install
# Build packages
npm run build
# Run tests
npm test
# Run CLI locally
npm run start -- check -f examples/hello-world.yamlApache 2.0 - See LICENSE for details.
Wanna check the vibe? Get started at vibescheck.io π