Skip to content

SWE Bench Quick Reference

rUv edited this page Aug 7, 2025 · 1 revision

SWE-bench Quick Reference Card

πŸš€ Quick Start Commands

# Test single instance
swarm-bench swe-bench official --limit 1

# Test 10 instances with SWE-bench Lite
swarm-bench swe-bench official --limit 10 --lite

# Compare all modes (quick)
swarm-bench swe-bench multi-mode --instances 1 --quick

# Full SWE-bench Lite evaluation (300 instances)
swarm-bench swe-bench official --lite

# Validate submission format
swarm-bench swe-bench official --validate

πŸ“Š Mode Comparison

Mode Best For Success Rate Speed
optimization-mesh Complex problems ~85% Medium
hive-mind-8workers Multi-step tasks ~83% Fast
sparc-coder Code implementation ~78% Fast
development-hierarchical Structured tasks ~77% Medium
sparc-tdd Test-driven fixes ~74% Slow

πŸ”§ Configuration Options

Official Evaluation

swarm-bench swe-bench official [OPTIONS]

Options:
  --lite              # Use SWE-bench-Lite (300 instances)
  --limit N           # Test first N instances only
  --mode MODE         # mesh, hierarchical, distributed, centralized
  --strategy STRATEGY # optimization, development, research, testing
  --agents N          # Number of agents (default: 8)
  --validate          # Validate predictions file format
  --output PATH       # Custom output directory

Multi-Mode Testing

swarm-bench swe-bench multi-mode [OPTIONS]

Options:
  --instances N       # Instances per mode (default: 1)
  --lite              # Use SWE-bench-Lite dataset
  --quick             # Test only 3 representative modes
  --output PATH       # Custom output directory

πŸ“ˆ Output Files

Generated Results

benchmark/swe-bench-official/results/
β”œβ”€β”€ predictions.json          # πŸ“€ Submit to leaderboard
β”œβ”€β”€ evaluation_report_*.json  # πŸ“Š Detailed metrics
└── multi_mode_report_*.json  # πŸ”€ Mode comparison

Submission Format

{
  "instance_id": {
    "model_patch": "<git diff content>",
    "model_name_or_path": "claude-flow-swarm",
    "instance_id": "repo__repo-issue"
  }
}

🎯 Expected Performance

Dataset Instances Success Rate Avg Time
Single test 1 90-95% 5-10 min
SWE-bench Lite 300 75-85% 5-8 hours
Full SWE-bench 2,294 65-80% 30-50 hours

πŸ” Troubleshooting

Common Issues

❌ "No patch generated"

# Check if claude-flow executable exists
ls -la claude-flow

# Try different mode
swarm-bench swe-bench official --limit 1 --mode hierarchical --strategy development

❌ Command timeout

# Increase timeout (default 600s)
swarm-bench swe-bench official --limit 1 --agents 4

# Use fewer agents for faster execution

❌ Invalid patch format

# Validate format
swarm-bench swe-bench official --validate --output predictions.json

# Check if patch files were generated
ls -la *.patch astropy_fix/*.patch

πŸ† Submission Checklist

  • Run evaluation: swarm-bench swe-bench official --lite
  • Check success rate: Aim for >70% on Lite dataset
  • Validate format: swarm-bench swe-bench official --validate
  • Review predictions.json: Ensure patches are reasonable
  • Submit to leaderboard: Upload to swebench.com

🎨 Example Workflow

# 1. Quick test to verify setup
swarm-bench swe-bench official --limit 1
# βœ… Should generate a patch in 5-10 minutes

# 2. Small batch test
swarm-bench swe-bench official --limit 5 --lite
# βœ… Should have >60% success rate

# 3. Mode comparison
swarm-bench swe-bench multi-mode --instances 2 --quick
# βœ… Identify best mode for your system

# 4. Full evaluation with best mode
swarm-bench swe-bench official --lite --mode mesh --strategy optimization
# βœ… Should take 5-8 hours, aim for >75% success

# 5. Validate and submit
swarm-bench swe-bench official --validate
# βœ… Upload predictions.json to leaderboard

πŸ“ž Need Help?


Quick Reference v1.0 - January 2025

Clone this wiki locally