JFBench is a benchmark suite for evaluating Japanese LLM instruction-following performance. It provides scripts for generation, evaluation, summary, and visualization.
Dependencies are managed with uv.
uv syncSome constraints use an LLM as a judge for evaluation. By default, gpt-oss-120b is used via OpenRouter. Set the OpenRouter API key in OPENROUTER_API_KEY.
export OPENROUTER_API_KEY="your_openrouter_api_key"The scripts below live under src/jfbench.
Example (evaluate a model on OpenRouter):
uv run python src/jfbench/benchmark/eval.py \
--benchmark "ifbench" \
--output-dir data/benchmark_results \
--n-constraints "1,2,4,8" \
--constraint-set "test" \
--n-benchmark-data 200 \
--model-specs-json '[{"provider": "openrouter", "model": "qwen/qwen3-30b-a3b-thinking-2507", "model_short": "Qwen3 30B A3B Thinking 2507"}]'Example (evaluate a local vLLM server): options are passed via extra_body. See src/jfbench/llm.py for details.
uv run python src/jfbench/benchmark/eval.py \
--benchmark "ifbench" \
--output-dir data/benchmark_results \
--n-constraints "1,2,4,8" \
--constraint-set "test" \
--n-benchmark-data 200 \
--model-specs-json '[{"provider": "vllm", "model": "/path/to/model_to_evaluate", "model_short": "Model to evaluate", "extra_body": {"base_url": "http://localhost:8001/v1"}}]'\
--judge-model-spec-json '{"provider": "vllm", "model": "/path/to/judge_model", "model_short": "Local vLLM Judge", "extra_body": {"base_url": "http://localhost:8000/v1"}}'--benchmark: Benchmark name. Currently onlyifbenchis supported. Defaultifbench.--ifbench-dataset-path: Path to an external IFBench JSONL file. DefaultNone, which uses the bundled dataset underdata/.--output-dir: Directory for result JSONL files. Defaultdata/benchmark_results.--with-generate/--no-with-generate: Enable or disable generation. Default enabled.--with-eval/--no-with-eval: Enable or disable evaluation. Default enabled.--override: Re-run even if matching entries already exist. Default disabled.--n-constraints: Number of constraints. Comma-separated values supported. Default1.--constraint-set: Constraint set (training/test). Defaulttest.--n-benchmark-data: Number of entries to use. If omitted, use all entries whenn_constraintsis 1. Required whenn_constraintsis 2 or higher.--seed: Random seed. Default42.--model-specs-json(required): JSON string that lists the evaluated models.--judge-model-spec-json: JSON string for the judge model. By default, OpenRoutergpt-oss-120bis used with reasoning effortmedium.--n-concurrent-generations: Concurrent generation requests. Use-1to send all at once. Default-1.
Example:
uv run python src/jfbench/benchmark/analyze.py \
--results-path data/benchmark_results--results-path: JSONL file or directory to analyze. Defaultdata/benchmark_results.jsonl.--constraint: Filter to records that include the named constraint.--show-generated: Show generated responses after the summary table.
Example:
uv run python src/jfbench/visualization/visualize.py \
--input-dir data/benchmark_results \
--output-dir visualization_output \
--n-constraints 1 \
--prompt-source ifbench--input-dir: Directory with result JSONL files. Defaultdata/benchmark_results.--output-dir: Output directory for charts. Defaultvisualization_output.--drop-incomplete: Drop rows without completed evaluations. Default disabled.--n-constraints(required): Constraint counts to include. Can be repeated or comma-separated.--prompt-source: Prompt sources to include. Onlyifbenchis supported. Can be repeated or comma-separated.--models: Filter to specific model names. Can be repeated.--constraint-set: Constraint set filters (training/test). Can be repeated or comma-separated. By default both are included.--model-label-map: JSON string mapping model labels.