Where AI Code Stands Trial - A comprehensive evaluation platform that transforms code quality assessment into intelligent hill climbing through knowledge graph-powered insights.
CodeVerdict is an enterprise-grade AI code evaluation platform that goes beyond traditional metrics to provide intelligent, actionable insights for model improvement. We transform raw evaluation data into strategic improvement intelligence through:
- π€ Smart Triage: 50/50 auto-manual evaluation split with LLM-as-judge
- π Multi-Dimensional Metrics: Pass@k, code quality, security, and beyond
- π§ Knowledge Graph Intelligence: Neo4j-powered pattern discovery and hill climbing
- π¬ Continuous Improvement: Automated intervention recommendations
- π Production Ready: FastAPI, MLflow, and enterprise tooling
CodeVerdict Core
βββ π― Evaluation Engine
β βββ Auto Evaluator (LLM-as-Judge)
β βββ Manual Evaluator (Argilla Integration)
β βββ Triage Engine (50/50 Smart Split)
βββ π Knowledge Graph
β βββ Neo4j Pattern Storage
β βββ MLflow-Neo4j Bridge
β βββ AI Agent Query Service
βββ π¬ Model Registry
β βββ MLflow Experiment Tracking
β βββ Enhanced Registry with KG Insights
βββ π API Layer
βββ FastAPI REST API
βββ Real-time Dashboard
codeverdict/
βββ π api/
β βββ main.py # FastAPI application & endpoints
βββ π config/
β βββ settings.py # Pydantic settings management
βββ π data/
β βββ models.py # Pydantic data models
βββ π evaluation/
β βββ auto_evaluator.py # LLM-as-judge auto evaluation
β βββ manual_evaluator.py # Argilla manual evaluation setup
β βββ triage_engine.py # Smart 50/50 triage logic
βββ π knowledge_graph/
β βββ neo4j_service.py # Neo4j connection & operations
β βββ mlflow_neo4j_bridge.py # MLflow to Neo4j bridge
β βββ agent_queries.py # AI agent query service
βββ π models/
β βββ registry.py # MLflow model registry
β βββ enhanced_registry.py # KG-enhanced registry
βββ π orchestration/
β βββ workflows.py # Evaluation pipeline workflows
βββ π utils/
β βββ helpers.py # Utility functions
βββ π tests/
β βββ test_evaluation.py
β βββ test_triage.py
β βββ test_knowledge_graph.py
βββ requirements.txt
βββ docker-compose.yml
βββ .env.example
βββ README.md# Clone the repository
git clone https://github.com/cklam12345/codeverdict.git
cd codeverdict
# Install dependencies
pip install -r requirements.txt
# Setup environment
cp .env.example .env
# Edit .env with your API keys and configurations# Start all services (MLflow, Neo4j, Argilla, FastAPI)
docker-compose up -dfrom codeverdict.api.main import app
from codeverdict.config.settings import settings
# The system automatically initializes with sample evaluation sets
# Visit http://localhost:8000/docs to explore the APICodeVerdict evaluates AI-generated code across multiple dimensions:
- Pass@k: Probability of correct solution in k attempts
- G-Pass@k: Generalization-focused Pass@k with hidden tests
- Test Coverage: Percentage of test cases passed
- Edit Distance to Fix: Characters needed to fix incorrect code
- Readability Score: Code understandability (1-5)
- Efficiency Score: Algorithm performance (1-5)
- Security Score: Vulnerability detection (1-5)
- Style Adherence: Coding standards compliance (1-5)
- Time-to-Correct-Solution: Developer productivity metric
- Adoption Rate: AI suggestion acceptance rate
- Cost-per-Correct-Solution: Economic efficiency
Our Neo4j knowledge graph enables intelligent hill climbing:
from codeverdict.knowledge_graph.agent_queries import HillClimbingAgentQueries
# Get AI-powered improvement recommendations
agent = HillClimbingAgentQueries(kg_service)
interventions = agent.find_high_roi_interventions("model-v2")
# Returns:
# [
# {
# "failure_type": "recursion_base_cases",
# "failure_count": 45,
# "fixability_score": 0.89,
# "expected_lift": "+12% Pass@1"
# }
# ]// Find similar failure patterns across models
MATCH (v:Verdict)-[sim:SIMILAR_FAILURE]-(other:Verdict)
WHERE v.overall_score < 0.7
RETURN v.status, COUNT(sim) as pattern_frequency
ORDER BY pattern_frequency DESC
// Get improvement trajectory
MATCH (model:Model)-[:EVALUATED_IN]->(run:EvaluationRun)
WITH run ORDER BY run.timestamp
RETURN run.timestamp, run.metrics.pass1Our 50/50 auto-manual split intelligently routes evaluations:
from codeverdict.evaluation.triage_engine import CodeVerdictTriageEngine
triage_engine = CodeVerdictTriageEngine(
auto_eval_threshold=0.8,
manual_sample_rate=0.5
)
auto_batch, manual_batch = triage_engine.triage_completions(completions)
print(f"π Auto: {len(auto_batch)}, Manual: {len(manual_batch)}")- Security Audits: Always manual review
- High-Quality Code: Auto-approve (confidence > 0.8)
- Borderline Cases: Manual review + sampling
- Critical Failures: Auto-reject with detailed analysis
Track every evaluation with comprehensive experiment tracking:
from codeverdict.models.registry import CodeVerdictRegistry
registry = CodeVerdictRegistry(settings.mlflow_tracking_uri)
# Register evaluation results
registry.register_verdict(
verdict=final_verdict,
model_id="codegen-2b",
prompt_id="fibonacci_recursive"
)curl -X POST "http://localhost:8000/evaluate" \
-H "Content-Type: application/json" \
-d '{
"model_id": "your-model-v1",
"eval_set_name": "codeverdict_sample_evals"
}'curl "http://localhost:8000/results/your-model-v1"curl "http://localhost:8000/kg/insights/your-model-v1"# Automated improvement recommendations
insights = agent_queries.find_high_roi_interventions("model-v2")
training_spec = insights.generate_training_curriculum()# Group similar failures for targeted fixes
clusters = agent_queries.find_similar_failure_clusters([
"recursion_errors", "off_by_one"
])# Predict improvement before training
predicted_lift = kg_service.predict_intervention_roi(
intervention="synthetic_tree_data",
historical_success_rate=0.85
)# docker-compose.yml
version: '3.8'
services:
fastapi:
build: .
ports: ["8000:8000"]
environment:
- NEO4J_URI=bolt://neo4j:7687
- MLFLOW_TRACKING_URI=http://mlflow:5000
mlflow:
image: mlflow/mlflow:latest
ports: ["5000:5000"]
neo4j:
image: neo4j:5.0
ports: ["7687:7687", "7474:7474"]
argilla:
image: argilla/argilla-server:latest
ports: ["6900:6900"]# .env
MLFLOW_TRACKING_URI=sqlite:///mlflow.db
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=codeverdict
ARGILLA_API_URL=http://localhost:6900
OPENAI_API_KEY=your-key-hereAfter running CodeVerdict, you get:
{
"model_id": "codegen-2b",
"summary": {
"total_prompts": 150,
"auto_evaluated": 75,
"manual_reviewed": 75,
"average_score": 0.82,
"verdict_distribution": {
"auto_approved": 60,
"manual_approved": 68,
"rejected": 22
}
},
"improvement_insights": {
"high_roi_interventions": [
{
"target": "recursion_base_cases",
"expected_lift": "+12%",
"effort_required": "low",
"historical_success_rate": 0.89
}
],
"predicted_next_score": 0.89
}
}CodeVerdict accelerates research cycles:
Evaluation β Aggregate Scores β Guess Improvements β Train Blindly (4-6 weeks)
Evaluation β Pattern Analysis β Targeted Interventions β Measured Improvement (1-2 weeks)
β
Knowledge Graph Wisdom
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
- MLflow for experiment tracking
- Neo4j for knowledge graph capabilities
- Argilla for human-in-the-loop evaluation
- FastAPI for high-performance API framework
- Phoenix for ML observability
CodeVerdict turns evaluation data into improvement intelligence. Stop guessing what to fix next - let the knowledge graph guide your hill climbing.
Get started:
docker-compose up -d
curl -X POST "http://localhost:8000/evaluate" -H "Content-Type: application/json" -d '{"model_id": "your-model"}'Visit:
- API Docs: http://localhost:8000/docs
- MLflow UI: http://localhost:5000
- Argilla: http://localhost:6900
- Neo4j Browser: http://localhost:7474
Transform your AI evaluation from metrics to intelligence! π