Find real bugs in Python codebases — automatically.
A3 combines non-LLM static analysis with agentic LLM triage for 99%+ accuracy:
- Static analysis first: Uses bytecode analysis, barrier-certificate proofs, and Z3-backed symbolic execution to automatically prove 99% of candidates as false positives
- Agentic LLM triage second: An LLM agent explores the codebase — reading files, searching for guard patterns, checking callers, inspecting tests — then classifies the remaining 1% as TP or FP
No overwhelming noise. No alert fatigue. Just real bugs that matter.
pip install a3-pythonRequires Python ≥ 3.11. The only core dependency is z3-solver (installed automatically).
For CI features (GitHub Actions workflows, LLM triage, SARIF output):
pip install a3-python[ci]a3 scan . --output-sarif results.sarifThat's it. A3 runs a 7-step pipeline — call graph construction, crash summary computation, guard detection, barrier-certificate proofs, DSE confirmation — and reports the surviving true-positive candidates.
a3 scan . --triageThat's it — one command. A3 scans, writes SARIF, then auto-detects your API key (OPENAI_API_KEY, ANTHROPIC_API_KEY, or GITHUB_TOKEN) and launches an agentic triage where the LLM explores the codebase using tools (reading files, searching for guards, checking callers and tests) before classifying each finding. Only confirmed true positives survive.
You can also specify the provider explicitly:
a3 scan . --triage openai --output-sarif results.sarif
a3 scan . --triage github # uses GITHUB_TOKEN, free in CI
a3 scan . --triage anthropic # uses ANTHROPIC_API_KEYOr run scan and triage as separate steps:
a3 scan . --output-sarif results.sarif
a3 triage --sarif results.sarif --provider openai --agentic --verbose# Symbolic execution portfolio (kitchensink) is enabled by default
a3 scan /path/to/project \
--interprocedural \
--dse-verify \
--deduplicate \
--min-confidence 0.3
# To disable portfolio analysis (not recommended):
a3 scan /path/to/project --no-kitchensinka3 scan /path/to/project --output-sarif results.sarifUpload the SARIF file to GitHub's Code Scanning dashboard, or use the built-in CI integration (see below).
A3 ships with GitHub Actions workflows that continuously scan every push and every PR using a two-phase approach:
- Non-LLM static analysis scans the repo and automatically proves 99% as false positives
- Agentic LLM triage investigates the remaining 1% — the LLM reads source files, searches for guard patterns, checks callers and tests, then classifies each finding — zero API keys needed
Every GitHub Actions runner already has a GITHUB_TOKEN, which gives access to GitHub Models. That's all the agentic triage needs.
cd your-repo/
pip install a3-python[ci]
a3 init . --copilot
git add .github/ .a3.yml .a3-baseline.json
git commit -m "ci: add a3 static analysis"
git pushThat's it. Every push to main/master and every PR that touches Python files will now be scanned, triaged by an agentic LLM, and checked against the baseline. Results appear in GitHub's Code Scanning dashboard (Security → Code scanning alerts).
| File | What it does |
|---|---|
.github/workflows/a3-pr-scan.yml |
On every push & PR: scans the repo → agentic LLM investigates each finding (reads files, searches patterns, checks callers) → blocks if new bugs found → uploads SARIF |
.github/workflows/a3-scheduled-scan.yml |
Weekly (Monday 6 AM UTC): full-repo scan → agentic triage → auto-files GitHub Issues for new TPs → updates baseline |
.a3.yml |
Analysis configuration (what to scan, confidence thresholds, etc.) |
.a3-baseline.json |
Known-findings baseline for the ratchet (starts empty) |
push to main/master — or — PR opened (touches .py files)
│
├─ 1. Non-LLM static analysis scans the repo
│ • Bytecode analysis + Z3 symbolic execution
│ • Automatically proves 99% as false positives
│ • Outputs SARIF with remaining 1% of findings
│
├─ 2. Agentic LLM triage (multi-turn tool-use)
│ • For each surviving finding, an LLM agent:
│ - Reads the flagged function's source code
│ - Searches for guard checks, callers, tests
│ - Follows imports and explores related files
│ - Calls 'classify' with verdict + rationale
│ • Uses GITHUB_TOKEN → GitHub Models (zero config)
│
├─ 3. Baseline ratchet check
│ new bugs not in baseline → ❌ CI fails
│ all bugs already known → ✅ CI passes
│
└─ 4. SARIF uploaded to GitHub Code Scanning dashboard
findings appear under Security → Code scanning alerts
When you run a3 init . --copilot, it creates .github/workflows/a3-pr-scan.yml. Here's the key steps:
# .github/workflows/a3-pr-scan.yml
# Triggers on: push to main/master, all PRs touching .py files
- name: Get changed files
id: changed
run: |
git diff --name-only --diff-filter=ACMR "$BASE"...HEAD -- '*.py' > changed_files.txt
- name: Run a3
run: |
a3 scan . \
--output-sarif a3-results.sarif
- name: Agentic triage # ← the magic step
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # ← already exists, no setup
run: |
a3 triage \
--sarif a3-results.sarif \
--output-sarif a3-triaged.sarif \
--repo-root . \
--provider github \
--model gpt-4o \
--agentic \
--verbose
mv a3-triaged.sarif a3-results.sarif
- name: Check baseline
run: a3 baseline diff --sarif a3-results.sarif
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: a3-results.sarifThe agentic triage step gives the LLM tools to explore the repo — read_file, search_codebase, get_function_source, get_imports, list_directory — so it can check callers, find tests, verify guards, and understand context before classifying each finding.
The baseline file records all accepted findings. On each PR:
- New findings not in baseline → CI fails. The author must fix the bug or explicitly accept it.
- Findings that disappear → auto-pruned. The codebase is getting healthier.
- Pre-existing issues → ignored. You're never blocked on legacy debt.
# Check for new findings (exits 1 if any are new)
a3 baseline diff --sarif results.sarif
# Accept current findings into baseline
a3 baseline accept --sarif results.sarifIf you prefer Claude or GPT-5 via your own API key instead of GitHub Models:
# Anthropic (Claude) — agentic by default
export ANTHROPIC_API_KEY=sk-...
a3 triage --sarif results.sarif --output-sarif triaged.sarif --agentic
# OpenAI (GPT-5)
a3 triage --sarif results.sarif --provider openai --model gpt-5 --agentic
# GitHub Models (via GITHUB_TOKEN — free in CI)
a3 triage --sarif results.sarif --provider github --agenticFor third-party providers in GitHub Actions, add the API key as a repository secret:
- name: LLM triage (Anthropic)
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: a3 triage --sarif a3-results.sarif --output-sarif a3-triaged.sarif --agenticCreate a .a3.yml in your repo root (or use a3 init to generate one):
analysis:
interprocedural: true
# kitchensink: true by default (symbolic execution portfolio)
dse-verify: true
min-confidence: 0.3
deduplicate: true
ci:
fail-on-new-bugs: true
baseline-file: .a3-baseline.json
llm-triage: true # enabled by default with --copilot
llm-provider: github # uses GITHUB_TOKEN — no extra API keys
llm-model: gpt-5
sarif-upload: true
scan:
exclude:
- "tests/**"
- "docs/**"
- "**/test_*.py"When a config file is present, a3 scan reads it automatically — no flags needed.
Run static analysis on a file or project.
a3 scan <target> [options]
a3 <target> [options] # legacy syntax, same behavior
Analysis flags:
| Flag | Description |
|---|---|
--interprocedural |
Cross-function analysis with call graph and summaries |
--no-kitchensink |
Disable symbolic execution portfolio (enabled by default) |
--dse-verify |
Verify bugs with Z3-backed symbolic execution |
--deduplicate |
Deduplicate findings by type + location |
--min-confidence N |
Filter by confidence score (0.0–1.0, default: 0.7) |
--no-intent-filter |
Report all bugs regardless of intent classification |
--functions |
Treat each function as a tainted entry point (security mode) |
--all-functions |
Analyze every function as an entry point |
--context-depth N |
k-CFA context sensitivity (0, 1, 2, …) |
--check-termination |
Detect non-terminating loops |
--synthesize-invariants |
Generate inductive loop invariants |
Output flags:
| Flag | Description |
|---|---|
--output-sarif PATH |
Write SARIF 2.1.0 JSON |
--triage [PROVIDER] |
Run agentic triage after scan (auto-detects API key, or specify: openai, anthropic, github) |
--triage-model MODEL |
LLM model for integrated triage (default: provider-appropriate) |
--save-results PATH |
Write results as pickle (default: results/<name>_results.pkl) |
--verbose |
Detailed step-by-step output |
--config PATH |
Path to .a3.yml |
Exit codes: 0 = no bugs, 1 = bugs found, 2 = unknown, 3 = error.
Bootstrap a repository with CI workflows and config.
a3 init [repo_path] # default: current directory
a3 init . --copilot # enable Copilot triage (recommended)
a3 init . --llm-triage # enable triage with Anthropic/OpenAI
a3 init . --overwrite # replace existing files
Classify findings via agentic LLM investigation.
a3 triage --sarif results.sarif --provider github --agentic # GitHub Models (recommended)
a3 triage --sarif results.sarif --output-sarif triaged.sarif --agentic # Anthropic (default provider)
a3 triage --sarif results.sarif --provider openai --model gpt-5 --agentic
a3 triage --sarif results.sarif --verbose --agentic
The --agentic flag enables multi-turn tool-use: the LLM can read files, search for patterns, inspect callers, check tests, and explore the repo before classifying each finding. Without --agentic, a simpler one-shot prompt is used.
Providers: github (uses GITHUB_TOKEN), anthropic (uses ANTHROPIC_API_KEY), openai (uses OPENAI_API_KEY). Pass --api-key to override.
Manage the findings baseline (ratchet).
a3 baseline diff --sarif results.sarif # check for new bugs
a3 baseline accept --sarif results.sarif # update baseline
a3 baseline diff --sarif results.sarif --auto-issue # file GitHub issues
DIV_ZERO · NULL_PTR · INDEX_OOB · KEY_ERROR · TYPE_ERROR · ASSERT_FAIL · UNBOUND_VAR · INTEGER_OVERFLOW · NON_TERMINATION · MEMORY_LEAK · USE_AFTER_FREE · DOUBLE_FREE · DATA_RACE · DEADLOCK · TIMING_CHANNEL · INFO_LEAK · BOUNDS · RUNTIME_ERROR · TYPE_CONFUSION · OVERFLOW
Injection: SQL_INJECTION · COMMAND_INJECTION · CODE_INJECTION · PATH_INJECTION · LDAP_INJECTION · XPATH_INJECTION · NOSQL_INJECTION · REGEX_INJECTION · HEADER_INJECTION · COOKIE_INJECTION
Web: REFLECTED_XSS · SSRF · PARTIAL_SSRF · URL_REDIRECT · CSRF_PROTECTION_DISABLED · FLASK_DEBUG · INSECURE_COOKIE · JINJA2_AUTOESCAPE_FALSE
Crypto: WEAK_CRYPTO · WEAK_CRYPTO_KEY · BROKEN_CRYPTO_ALGORITHM · INSECURE_PROTOCOL
Deserialization: UNSAFE_DESERIALIZATION · XXE · XML_BOMB
Secrets: CLEARTEXT_LOGGING · CLEARTEXT_STORAGE · HARDCODED_CREDENTIALS
Files/Network: TAR_SLIP · INSECURE_TEMPORARY_FILE · WEAK_FILE_PERMISSIONS · BIND_TO_ALL_INTERFACES · MISSING_HOST_KEY_VALIDATION · CERT_VALIDATION_DISABLED
Regex DoS: REDOS · POLYNOMIAL_REDOS · BAD_TAG_FILTER · INCOMPLETE_HOSTNAME_REGEXP
A3 runs a 7-step symbolic execution pipeline using formal methods — no AI, no heuristics:
- Call Graph — Builds a whole-program call graph from all
.pyfiles - Crash Summaries — Disassembles bytecode to find divisions, None-dereferences, out-of-bounds accesses, taint flows, and 67 other bug patterns
- Symbolic Model Construction — Builds Z3 symbolic representations of Python code objects
- Guard Detection — Identifies bugs already protected by
if,try/except,assert,isinstancechecks using symbolic constraints - Z3 Symbolic Execution — Uses Z3 SMT solver to prove whether bugs are reachable by constructing satisfying assignments
- Staged Portfolio — Runs multiple proof strategies in parallel (barrier certificates, inductive invariants, k-induction)
- Classification — Separates production code from test code
Each bug receives one of three verdicts:
| Verdict | Meaning |
|---|---|
| FP (proven) | Barrier certificate or DSE proves the bug is unreachable — 99% of findings |
| TP candidate | No proof found; send to LLM for verification — ~1% of findings |
| DSE-confirmed TP | Z3 found a satisfying input that triggers the crash — Even more likely bug |
For findings that survive static analysis, A3 launches an agentic investigation — a multi-turn conversation where the LLM has access to tools:
| Tool | What it does |
|---|---|
read_file |
Read any source file (with optional line range) |
search_codebase |
Grep for regex patterns across the project |
get_function_source |
Look up any function by name |
get_imports |
See what a file imports |
list_directory |
Explore project structure |
classify |
Submit final TP/FP verdict with confidence + rationale |
The agent typically makes 2–6 tool calls per finding — reading callers, checking for guard patterns, looking at tests, following imports — then calls classify with its verdict.
This is not a one-shot prompt. The LLM decides what to investigate, gathers evidence iteratively, and only classifies when it has enough context. This produces significantly more accurate results than a single-pass LLM call.
# examples.py
def authenticate_user(username, user_database):
"""Look up user credentials from database."""
user_record = user_database.get(username)
return user_record['password_hash'] # NULL_PTR: user_record could be None
def calculate_completion_rate(completed, total):
"""Calculate completion percentage."""
return (completed / total) * 100 # DIV_ZERO: total could be 0
def get_database_host(config):
"""Extract database host from config."""
return config.database.host # NULL_PTR: config or database could be None
def get_latest_transaction(transactions):
"""Get the most recent transaction."""
sorted_txns = sorted(transactions, key=lambda t: t.date, reverse=True)
return sorted_txns[0].amount # BOUNDS: sorted_txns could be empty
def extract_email_from_csv(csv_line):
"""Parse email from third column of CSV."""
fields = csv_line.split(',')
return fields[2].strip() # BOUNDS: might not have 3 columns
def calculate_roi(profit, cost):
"""Calculate return on investment."""
return (profit / cost) * 100 # DIV_ZERO: cost could be 0
def get_product_total_price(inventory, product_id):
"""Calculate total price including tax."""
product = inventory.lookup(product_id)
return product.base_price * (1 + product.tax_rate) # NULL_PTR: product could be NoneRun a3:
$ a3 scan examples.py --interprocedural
============================================================
INTERPROCEDURAL ANALYSIS RESULTS
============================================================
Total bugs found: 14
BOUNDS (3)
- examples.get_first_user_email
examples.py:41
Confidence: 0.19
- examples.get_latest_transaction
examples.py:52
Confidence: 0.19
- examples.extract_email_from_csv
examples.py:76
Confidence: 0.19
DIV_ZERO (3)
- examples.calculate_completion_rate
examples.py:19
Confidence: 0.21
- examples.calculate_average_score
examples.py:64
Confidence: 0.21
- examples.calculate_roi
examples.py:87
Confidence: 0.21
NULL_PTR (7)
- examples.get_from_cache
examples.py:110
Confidence: 0.19
- examples.get_database_host
examples.py:31
Confidence: 0.19
- examples.authenticate_user
examples.py:13
Confidence: 0.19
- examples.get_product_total_price
examples.py:98
Confidence: 0.19
- examples.calculate_average_score
examples.py:64
Confidence: 0.19
... and 2 more
VALUE_ERROR (1)
- examples.get_first_user_email
examples.py:38
Confidence: 0.84A3 automatically recognizes when bugs are properly guarded:
# examples_safe.py
def authenticate_user(username, user_database):
"""SAFE: Checks if user exists."""
user_record = user_database.get(username)
if user_record is not None:
return user_record['password_hash']
return None
def calculate_completion_rate(completed, total):
"""SAFE: Checks for zero before division."""
if total != 0:
return (completed / total) * 100
return 0.0
def get_database_host(config):
"""SAFE: Validates nested attributes."""
if config is not None and config.database is not None:
return config.database.host
return "localhost"
def get_latest_transaction(transactions):
"""SAFE: Checks if list is empty."""
sorted_txns = sorted(transactions, key=lambda t: t.date, reverse=True)
if len(sorted_txns) > 0:
return sorted_txns[0].amount
return 0.0
def extract_email_from_csv(csv_line):
"""SAFE: Validates column count with length guard."""
fields = csv_line.split(',')
if len(fields) >= 3:
return fields[2].strip() # SAFE: len(fields) >= 3 guards fields[2]
return None$ a3 scan examples_safe.py --interprocedural
============================================================
INTERPROCEDURAL ANALYSIS RESULTS
============================================================
Total bugs found: 0All guards successfully detected by Z3 symbolic execution!
The key improvements:
- ✅ BOUNDS bugs: 0 (down from 3) - Length guards like
len(fields) >= 3now correctly protectfields[2] - ✅ NULL_PTR bugs: 0 (down from 7) - None-checks and nonnull guards detected
- ✅ DIV_ZERO bugs: 0 (down from 3) - Zero-checks detected
For production use with legacy codebases, enable LLM triage to filter remaining false positives in unguarded code:
a3 scan examples_safe.py --interprocedural --output-sarif results.sarif
a3 triage --sarif results.sarif --provider github --output-sarif filtered.sarif --agenticWith agentic triage, the LLM explores the codebase to verify each finding, achieving 99%+ accuracy.
# Full symbolic execution pipeline (kitchensink enabled by default)
a3 scan . --interprocedural --dse-verify --output-sarif results.sarif
# Agentic triage for remaining 1% after symbolic verification
a3 triage --sarif results.sarif --output-sarif triaged.sarif --provider github --agentic --verbose
# Baseline ratchet check
a3 baseline diff --sarif triaged.sarif --auto-issuedocker build -t a3 .
docker run --rm -v $(pwd)/my_project:/target a3 /target
docker run --rm -v $(pwd):/code a3 /code/myfile.py --functionsa3_python/
├── cli.py # CLI with subcommands (scan, init, triage, baseline)
├── analyzer.py # Core analysis engine
├── frontend/ # Python loading, bytecode compilation
├── cfg/ # Control-flow graph + call graph construction
├── semantics/ # Symbolic bytecode execution, crash summaries
├── z3model/ # Z3 value/heap modeling
├── unsafe/ # Bug type predicates (67 types)
├── contracts/ # External call modeling, taint sources/sinks
├── dse/ # Concolic execution (Z3-backed)
├── barriers/ # Barrier certificate synthesis (10 patterns)
└── ci/ # CI integration
├── sarif.py # SARIF 2.1.0 serializer
├── baseline.py # Ratchet / baseline management
├── triage.py # One-shot LLM classification
├── agentic_triage.py # Multi-turn agentic triage with tool-use
├── config.py # .a3.yml loader
├── init_cmd.py # `a3 init` bootstrapper
└── templates/ # GitHub Actions workflow YAMLs
git clone https://github.com/thehalleyyoung/a3-python.git
cd a3-python
pip install -e ".[dev,ci]"
pytest
pytest --cov=a3_pythonSee LICENSE file.