Skip to content

joinalahmed/skilleval

Repository files navigation

SkillEval

License Python Version

100% deterministic evaluation framework for AI agent skills

Open-source, vendor-neutral framework for evaluating AI agent skills with dual-phase scoring, comprehensive security scanning, and transparent grading.


Features

  • βœ… 100% Deterministic - No LLM judges, reproducible results
  • πŸ›‘οΈ 82+ Security Patterns - OWASP Web/API/LLM/Agentic coverage
  • πŸ“Š Dual-Phase Scoring - Separate packaging quality from runtime effectiveness
  • ⚑ Fast - Phase 1 evaluation in <1 second
  • 🎯 Confidence Weighting - Smart false positive reduction
  • 🐳 Container Isolation - Safe execution with Podman
  • πŸ“ˆ Baseline Comparison - WITH-SKILL vs WITHOUT-SKILL differential scoring
  • πŸŽ“ A-F Grading - Clear publish decisions with auto-reject

Quick Start

Installation

pip install -r requirements.txt
pip install -e .

Evaluate a Skill

# Fast packaging & security check (<1s)
python3 evaluate_skill.py /path/to/skill --phase1-only

# Full evaluation with runtime testing
python3 evaluate_skill.py /path/to/skill --full

# JSON output for CI/CD
python3 evaluate_skill.py /path/to/skill --phase1-only --format json

Example Output

Total Score: 91.0/100
Grade: A
Publish Decision: APPROVE

PILLAR 1: STATIC TESTS (50 points)
Score: 49.0/50 (Grade A)
  βœ… Frontmatter valid
  βœ… Description quality high
  ⚠️  Only 1 eval case (need 3+)

PILLAR 2: SECURITY (50 points)
Score: 42.0/50 (Grade B)
  βœ… No CRITICAL findings
  ⚠️  3 MEDIUM findings (confidence-weighted)

βœ… APPROVED - Ready to publish
   Eligible for featured listing

What's Evaluated

Phase 1: Packaging & Security (0-100 points)

Static Tests (50 points)

  • βœ… Frontmatter validity - YAML, name, description, version
  • βœ… Description quality - Length, vocabulary, trigger language
  • βœ… File completeness - SKILL.md, artifacts, structure
  • βœ… Script quality - Python/shell syntax validation
  • βœ… Eval suite - Test case coverage and diversity
  • βœ… Instruction clarity - Code examples, documentation

Security (50 points)

  • πŸ”’ 18 built-in checks (Layer 1) - Zero dependencies
    • Hardcoded secrets (API keys, tokens, passwords)
    • Code injection (SQL, XSS, command injection)
    • Data exfiltration patterns
    • Prompt injection detection
    • OWASP ASI compliance
  • πŸ”’ 64 advanced checks (Layer 2, optional) - SkillSpector integration
    • AST behavioral analysis
    • CVE database lookup
    • YARA malware signatures

Confidence Weighting:

  • β‰₯ 0.7: Full penalty, CRITICAL = auto-reject
  • 0.5-0.69: Full penalty, normal scoring
  • 0.3-0.49: Advisory only (shown, zero score impact)
  • < 0.3: Hidden

Phase 2: Runtime Effectiveness (0-100 points)

Functional Correctness (50 points)

  • Baseline vs skill comparison (WITH-SKILL - WITHOUT-SKILL)
  • 6 deterministic graders:
    • file_exists - File creation
    • content_match - Regex patterns
    • json_schema - JSON validation
    • command_output - Command results
    • exit_code - Script success
    • line_count - File size verification

LLM Safety (50 points)

  • Unbounded planning detection (>15 turns)
  • Infinite loop detection (high tool/turn ratio)
  • Context rot detection (>200K tokens)
  • Hallucination detection (claims without evidence)
  • Cost tracking (>$0.10 threshold)
  • Efficiency analysis (<20% tool usage)

Grade Scale

Grade Score Decision Meaning
A 90-100 APPROVE (featured eligible) Excellent
B 80-89 APPROVE Good
C 70-79 CONDITIONAL Acceptable with advisory
D 60-69 REQUIRE_ACK Needs improvement
F 0-59 BLOCK Not ready

Auto-Reject Conditions:

  • CRITICAL security finding (confidence β‰₯ 0.7)
  • Security score < 25/50 (50% floor)

OWASP Coverage

βœ… OWASP Top 10 Web (2021) - A01, A03, A06
βœ… OWASP API Security (2023) - API1-API10
βœ… OWASP LLM Top 10 (2023) - LLM01-LLM09
βœ… OWASP Agentic AI (2026) - ASI01-ASI10

Total: 28 unique security patterns


Architecture

Dual-Phase Design

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 1: Packaging & Security      β”‚
β”‚  ─────────────────────────────      β”‚
β”‚  β€’ Static Tests (50 pts)            β”‚
β”‚  β€’ Security Scan (50 pts)           β”‚
β”‚  β€’ <1 second                        β”‚
β”‚  β€’ No LLM calls                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Phase 2: Runtime Effectiveness     β”‚
β”‚  ───────────────────────────────    β”‚
β”‚  β€’ Functional (50 pts)              β”‚
β”‚  β€’ LLM Safety (50 pts)              β”‚
β”‚  β€’ 30-600 seconds                   β”‚
β”‚  β€’ Container isolation              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              ↓
        Dual-Score Report
   Phase 1: 91/100 (A)
   Phase 2: 63/100 (C)
   Overall: 77/100 (B)

Directory Structure

skilleval/
β”œβ”€β”€ src/skilleval/
β”‚   β”œβ”€β”€ models_phase1.py         # Phase 1 data models
β”‚   β”œβ”€β”€ models_phase2.py         # Phase 2 data models
β”‚   β”œβ”€β”€ scorers/
β”‚   β”‚   β”œβ”€β”€ static_scorer.py     # ST-1 through ST-8
β”‚   β”‚   β”œβ”€β”€ security_scorer.py   # Layer 1 + Layer 2
β”‚   β”‚   β”œβ”€β”€ harness_scorer.py    # Functional + Safety
β”‚   β”‚   └── phase1_orchestrator.py
β”‚   β”œβ”€β”€ pillars/
β”‚   β”‚   β”œβ”€β”€ static_tests.py
β”‚   β”‚   β”œβ”€β”€ security.py
β”‚   β”‚   β”œβ”€β”€ owasp_llm.py
β”‚   β”‚   └── harness.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ container_executor.py
β”‚       β”œβ”€β”€ trace_analytics.py
β”‚       └── cve_scanner.py
β”‚
β”œβ”€β”€ tests/                       # Test suite (85% coverage)
β”œβ”€β”€ examples/                    # Example skills
β”œβ”€β”€ docs/                        # Documentation
└── evaluate_skill.py            # Main CLI

CLI Usage

Basic Commands

# Phase 1 only (fast, <1s)
python3 evaluate_skill.py /path/to/skill --phase1-only

# Full evaluation
python3 evaluate_skill.py /path/to/skill --full

# JSON output
python3 evaluate_skill.py /path/to/skill --phase1-only --format json

# Save to file
python3 evaluate_skill.py /path/to/skill --output report.json

Batch Evaluation

for skill in /path/to/skills/*; do
    python3 evaluate_skill.py "$skill" --phase1-only \
        --output "reports/$(basename $skill).json"
done

CI/CD Integration

# Exit code 0 = passed, 1 = failed
python3 evaluate_skill.py /path/to/skill --phase1-only || exit 1

GitHub Actions:

- name: Evaluate Skill
  run: |
    pip install -r requirements.txt
    python3 evaluate_skill.py ./skills/my-skill --phase1-only --format json

Use Cases

Development

  • βœ… Pre-commit quality validation
  • βœ… Security scanning before publication
  • βœ… Interactive feedback during development

CI/CD Pipelines

  • βœ… Automated quality gates
  • βœ… Regression detection
  • βœ… Compliance enforcement

Skill Registries

  • βœ… Publication approval workflow
  • βœ… Featured listing eligibility
  • βœ… Security compliance verification

Security Audits

  • βœ… Vulnerability scanning
  • βœ… OWASP compliance reporting
  • βœ… Secret detection

Example Skill Evaluation

Input: /path/to/jira-comment-poster

Phase 1 Results:

Score: 91/100 (Grade A)
Static: 49/50
Security: 42/50
Duration: 0.01s
Findings: 3 scoreable, 4 advisory
Decision: APPROVED (featured eligible)

Phase 2 Results:

Score: 63/100 (Grade C)
Functional: 0/50 (no graders matched)
Safety: 53/100
Issues: 5 infinite loops, 1.77M tokens
Duration: 679s (11m 20s)
Cost: $0.29

Overall: Grade B (77/100) - APPROVE


Performance

Metric Phase 1 Phase 2 Full
Duration <1s 30-600s 30-600s
Memory <50 MB <512 MB <512 MB
LLM Calls 0 2 per eval case 2 per eval case
Determinism 100% 100% 100%
Throughput 200+ skills/sec 0.1 skills/sec 0.1 skills/sec

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone repository
git clone https://github.com/skilleval/skilleval.git
cd skilleval

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Install in development mode
pip install -e .

# Run tests
pytest tests/

# Run linting
ruff check src/
mypy src/

Adding New Checks

Static Test:

# src/skilleval/scorers/static_scorer.py
def _st9_new_check(self) -> StaticTestResult:
    """ST-9: New static check."""
    # Implementation
    return StaticTestResult(...)

Security Check:

# src/skilleval/scorers/security_scorer.py
def _l1_19_new_check(self, content: str, file: str) -> List[SecurityFinding]:
    """L1-19: New security check."""
    findings = []
    # Pattern matching
    return findings

Documentation


Roadmap

βœ… v1.0.0 (Current)

  • Phase 1: Static Tests + Security (production-ready)
  • CLI with JSON/text output
  • Dual-score reporting
  • 82+ security patterns
  • Confidence weighting

πŸ”„ v1.1.0 (In Progress)

  • Phase 2: Runtime Effectiveness integration
  • Harness execution orchestration
  • End-to-end dual-score testing

πŸ“‹ v2.0.0 (Future)

  • Layer 2 security (SkillSpector)
  • HTML report generation
  • Batch evaluation dashboard
  • MCP server integration

License

MIT License - see LICENSE for details.

Copyright (c) 2026 SkillEval Contributors


Support


References


Status: βœ… Production Ready
Version: 1.0.0
Last Updated: 2026-06-25

About

100% deterministic evaluation framework for AI agent skills with dual-phase scoring, comprehensive security scanning, and transparent grading

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages