| title | Self-Healing Code Agent |
|---|---|
| sdk | gradio |
| sdk_version | 6.6.0 |
| app_file | app.py |
| pinned | false |
| license | mit |
| short_description | Autonomous agent that self-heals Python code errors. |
Rohan Jain β MS Machine Learning, University of Maryland
MS ML student at UMD with a background in data science and analytics, transitioning into applied LLM systems engineering. Focused on building AI systems that are reliable and observable in real execution environments β not just accurate in notebooks. This project was built to explore what it takes to make an autonomous agent self-correct in a live Python execution environment, moving well beyond one-shot prompting.
| π GitHub | github.com/Rohanjain2312 |
| π€ HuggingFace | huggingface.co/rohanjain2312 |
| πΌ LinkedIn | linkedin.com/in/jaroh23 |
| π§ Email | jaroh23@umd.edu |
An autonomous agent that generates Python code, adversarially tests it with edge cases, diagnoses failures through structured root-cause analysis, and iteratively repairs the solution β all without human input. The core problem: LLMs produce incorrect code on the first attempt more often than not. This system treats that as a solvable engineering problem by wrapping the LLM in a structured, self-correcting feedback loop.
| Run it now | |
|---|---|
| π€ HF Spaces β no setup required | Runs on CPU, expect 30β90s per agent step |
| π¬ Google Colab β GPU | T4 GPU, public gradio.live link via share=True |
Live run: the agent generated a Python solution, tested it adversarially, diagnosed a failure, and applied a targeted repair β all autonomously. Execution Timeline (bottom left) and Learning Log (right) update in real time.
β οΈ Deploying to HuggingFace Spaces? Seedocs/deployment-issues.mdfor a full log of known failure modes and fixes encountered during this build.
- Python 3.11+
ANTHROPIC_API_KEYβ required for all non-generator roles (QA, debugger, critic, memory summarizer)- Ollama running locally (
ollama serve) β optional, used only for the generator role
pip install -r requirements.txt
# Option 1: Ollama generator + Claude for smart roles (recommended locally)
ollama pull llama3.2:3b
ANTHROPIC_API_KEY=sk-ant-... python app.py
# Option 2: All roles use Claude (no Ollama needed; repair loop triggers less often)
LLM_PROVIDER=anthropic ANTHROPIC_API_KEY=sk-ant-... python app.pypip install transformers torch accelerate
# LLM_PROVIDER=huggingface is REQUIRED to route the generator role to the local 3B model.
# Without it, the generator silently falls back to Claude (repair loop rarely triggers).
LLM_PROVIDER=huggingface ANTHROPIC_API_KEY=sk-ant-... python app.pyLLM_PROVIDER=mock python app.py# Full benchmark (requires real LLM provider)
python -m evaluation.run_benchmark --provider ollama --max-iterations 4
# Single task
python -m evaluation.run_benchmark --task-ids interval_merge_001 --provider ollama# All tests (uses mock provider β no models needed)
LLM_PROVIDER=mock pytest
# Single test file
LLM_PROVIDER=mock pytest tests/test_sandbox.py -vThis system has three architectural layers with meaningfully different properties:
The Generator, QA, Executor, and Memory Summarizer nodes run in a fixed order. They are LLM-augmented steps, not autonomous agents β each receives a structured prompt, calls the LLM once, validates the output against a JSON schema, and passes results downstream. The LangGraph state machine handles routing.
The Debugger is the one genuinely agentic component. It runs a ReAct loop: think β use tool β observe β repeat β conclude. It can invoke three tools before issuing its final diagnosis:
run_snippet: execute a Python snippet to test an edge-case hypothesisinspect_function: parse AST to verify function signatures and docstringsdiff_iterations: compare code across repair attempts to track convergence
This is where actual autonomous multi-step reasoning happens β the LLM decides which tools to use, in what order, and when to stop investigating.
Configurable via AgentConfig.autonomy_level:
review_repairs(default): pause before each repair for human approval viainterrupt()full_auto: no interrupts β fully autonomous repair loop (used by the HF Spaces deployment, set inapp.py)review_all: pause before generation AND before each repair
The HF Spaces entry point (
app.pyat the repo root) explicitly overrides this tofull_autobecause LangGraph'sinterrupt()pause/resume flow can desynchronize with Gradio's streaming generator on hosted deployments.
[generate_spec_tests] β once (spec-blind oracle tests, if enabled)
β
generate_solution β generates initial code
β
create_adversarial_tests β QA hunts for edge cases in the code
β
execute_solution β runs BOTH spec + adversarial tests in sandbox
β (pass) β [critic_review] β sanity-checks correctness (if enabled)
β (approve) β END
β (reject) β diagnose_failure
β (fail) β diagnose_failure β ReAct loop with tools
β
update_learning_log
β
[review_repair] β HITL interrupt() (if not full_auto)
β
increment_iteration
β
generate_solution (or fan_out_repairs if parallel_strategies)
| Role | Type | Description | Prompt |
|---|---|---|---|
| Generator | Pipeline node | Writes initial code; applies targeted repairs guided by diagnosis | generator.yaml |
| QA Adversarial | Pipeline node | Generates hostile edge-case tests designed to break the solution | qa_adversarial.yaml |
| Debugger | ReAct agent | Root-cause analysis with tool use β runs think/act/observe loop | debugger.yaml |
| Memory Summarizer | Pipeline node | Compresses iteration history into β€5 bullet lessons | memory_summarizer.yaml |
| Critic | Pipeline node | Sanity-checks passing solutions for correctness issues the tests missed | critic.yaml |
| Concept | Implementation |
|---|---|
| ReAct agent loop | Debugger runs think β use tool β observe β repeat before issuing diagnosis |
| Dual-oracle testing | Spec-blind tests (generated before code exists) + adversarial tests (generated after) β both must pass |
| Human-in-the-loop | interrupt() pauses the graph for human review; AgentConfig.autonomy_level controls when |
| Parallel repair strategies | Fan-out via LangGraph Send() β 3 strategies run concurrently, tournament selection picks winner |
| Agent self-reflection | Critic node reviews passing solutions for correctness issues the test suite missed |
| Time-travel debugging | Checkpointer stores all states; fork_from_iteration() rewinds and replays with modified state |
| Confidence-aware routing | Low-confidence diagnoses route to blind retry instead of targeted repair |
| Structured outputs + schema validation | Every LLM call validated against a typed JSON schema; coercion + regex salvage handle malformed output |
| Prompt engineering + versioning | YAML prompt files per role, git-versioned, hot-reloadable β decoupled from agent code |
| Token / context management | Rolling memory summarizer (max 5 lessons) + token-aware truncation with re-render |
| Provider-agnostic inference | Unified LLM router: Ollama β HuggingFace β Mock. Per-role model overrides supported |
| Observability | LangSmith tracing (opt-in), per-node metrics, degraded-node tracking, event stream |
Run conditions:
llama3via Ollama, local CPU, max 4 iterations per task. 8 tasks across 6 categories. Reference-validated results use held-out ground-truth test suites not seen by the agent.
| Metric | Self-Reported | Reference-Validated |
|---|---|---|
| Tasks evaluated | 8 | 8 |
| First-pass success | 3 / 8 (37%) | β |
| Healed after repair | 4 / 5 initially-failing (80%) | β |
| Final success rate | 7 / 8 (87%) | run with --validate-reference to generate |
| Avg iterations per task | 1.875 | β |
| Unresolved | 1 β word frequency with complex tie-breaking | β |
Self-reported: agent's own generated tests pass. Reference-validated: held-out ground-truth assertions pass.
| Category | Success |
|---|---|
| Interval merging | 100% |
| Data normalization | 100% |
| Log processing | 100% |
| Data transformation | 100% |
| Boundary conditions | 100% |
| Text processing | 50% |
| Limitation | Why It Happens | How to Overcome |
|---|---|---|
| Slow inference on HF Spaces (30β90s/step) | Free tier = CPU only, no GPU | Upgrade to HF Spaces Pro (A100) or swap to an API-hosted model via the router |
| Schema instability on small models | 3B models frequently truncate or mis-format JSON β JSON-encoding Python source roughly doubles character count under tight token limits | Use 8B+ model, or an API provider with native structured output support |
| No cross-session memory | The learning log resets on every new task | Add ChromaDB vector store β scaffolded in agent/memory_store.py, enable via AgentConfig.enable_cross_session_memory |
| Single-file execution sandbox | Subprocess executor runs one file β cannot handle solutions spanning multiple modules | Extend sandbox to write a temp package directory with __init__.py |
| Layer | Technology |
|---|---|
| Agent orchestration | LangGraph 0.3+ (async state machine, Send() fan-out, interrupt() HITL, checkpointing) |
| LLM inference | Llama-3.2-3B-Instruct (HF Spaces) Β· Llama-3.1-8B (Colab) Β· Ollama (local) |
| UI & deployment | Gradio 6.6 Β· HuggingFace Spaces |
| Prompt management | YAML templates per agent role, git-versioned, hot-reloadable |
| Schema validation | jsonschema + custom coercion + regex salvage fallback |
| Execution sandbox | Python subprocess with rlimit resource limits and structured output markers |
| Async runtime | asyncio Β· AnyIO Β· LangGraph async nodes Β· async event bus (pub/sub) |
| Testing | pytest Β· pytest-asyncio Β· mock provider (no GPU required) |
The agent architecture, workflow design, LangGraph state machine topology, YAML prompt schemas, and all key engineering decisions were designed and authored by Rohan Jain. Claude Code was used as an implementation accelerator to handle repetitive boilerplate, file scaffolding, and iterative debugging β similar to how a senior engineer uses Copilot or a junior developer for implementation tasks while retaining full design ownership.
All architectural choices, agent interaction patterns, structured output schemas, and the self-healing repair logic reflect the author's original engineering judgment.
agent/ LangGraph state machine and node implementations
nodes/ Individual agent nodes (generate, QA, execute, debug, critic, etc.)
tools.py Debugger tools: run_snippet, inspect_function, diff_iterations
config.py AgentConfig β all feature flags in one place
metrics.py Per-node and per-run metrics tracking
framework/ Async event bus and streaming infrastructure
llm/ Unified LLM router, providers, prompt loading
providers/ Ollama, HuggingFace, Mock implementations
sandbox/ Safe Python subprocess execution environment
prompts/ YAML prompt templates per agent role
evaluation/ Benchmark harness, metrics, and precomputed results
demo/ Gradio application for HuggingFace Spaces
tests/ Pytest test suite (mock provider, no GPU needed)
| Variable | Required? | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
Required | β | Anthropic API key for Claude. All non-generator roles (QA, debugger, critic, memory summarizer) use Claude. Without this key the agent falls back to Mock. |
LLM_PROVIDER |
Required for HF Spaces | auto | huggingface | ollama | anthropic | mock. Set to huggingface on HF Spaces to route the generator to the local 3B model. |
OLLAMA_GENERATOR_MODEL |
Optional | llama3.2:3b |
Ollama model used for the generator role when Ollama is reachable. |
OLLAMA_BASE_URL |
Optional | http://localhost:11434 |
Ollama server URL |
HF_MODEL |
Optional | meta-llama/Llama-3.2-3B-Instruct |
HuggingFace model ID for the generator role |
HF_TOKEN |
Optional | β | HuggingFace token for cross-session lesson persistence to a HF Dataset |
USE_4BIT |
Optional | unset | Enable 4-bit quantization for the HF generator (requires bitsandbytes) |
LANGCHAIN_TRACING_V2 |
Optional | unset | Enable LangSmith tracing |
LANGCHAIN_API_KEY |
Optional | unset | LangSmith API key |
LANGCHAIN_PROJECT |
Optional | self-healing-agent |
LangSmith project name |
