Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
412 changes: 412 additions & 0 deletions .claude/skills/msgspec-patterns/SKILL.md

Large diffs are not rendered by default.

120 changes: 120 additions & 0 deletions .claude/skills/msgspec-struct-gc-check/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
name: msgspec-struct-gc-check
description: Check whether msgspec.Struct types can safely use gc=False. Use when adding or changing msgspec.Struct definitions, or when reviewing code that uses msgspec structs.
allowed-tools: Read, Grep, Glob
---

# msgspec.Struct gc=False Safety Check

## When to use this skill

- Adding or modifying a class that inherits from `msgspec.Struct`
- Reviewing or refactoring code that defines or uses msgspec structs
- Deciding whether to add or remove `gc=False` on a Struct

## Why gc=False matters

Setting `gc=False` on a Struct means instances are **never tracked** by Python's garbage collector. This reduces GC pressure and can improve performance when many structs are allocated. The **only** risk: if a **reference cycle** involves only gc=False structs (or objects not tracked by GC), that cycle will **never be collected** (memory leak).

Reference: [msgspec Structs – Disabling Garbage Collection](https://jcristharif.com/msgspec/structs.html#struct-gc).

## Verified safety constraints

Use these constraints to decide if a Struct can use `gc=False`. All must hold.

### 1. No reference cycles

- The struct (and any container it references) must never be part of a reference cycle.
- **Multiple variables** pointing to the same struct (`x = s; y = x`) are **safe** — that is not a cycle. A cycle is A → B → … → A.
- **Returning** a struct from a function is **safe**. What matters is whether any reference path leads back to the struct (e.g. struct's list contains the struct or something that holds the struct).

### 2. No mutation that could create cycles

- **Do not mutate** struct fields after construction in a way that could introduce a cycle (e.g. set a field to an object that references the struct, or append the struct to its own list/dict).
- **Frozen structs** (`frozen=True`) prevent field reassignment; `force_setattr` in `__post_init__` is one-time init only, so that's acceptable.
- Assigning **scalars** (int, str, bool, float, None) to fields is safe — they cannot form cycles.

### 3. Mutable containers (list, dict, set) on the struct

- If the struct has list/dict/set fields, either:
- **Never mutate** those containers after creation (no `.append`, `.update`, `[...] = ...`, etc.), and never store in them any object that references the struct, or
- Do not use `gc=False` (conservative).
- **Reading** from containers (e.g. `x = struct.foobars[i]`) does not create cycles and is allowed.

### 4. Nested structs

- If a struct holds another Struct (or holds containers that hold Structs), the same rules apply to the whole reference graph: no cycles, no mutation that could create cycles. If any nested Struct uses `gc=False`, the whole graph must still be cycle-free.

### 5. Generic / mixins

- With `gc=False`, the type must be compatible with `__slots__` (e.g. if using `Generic`, the mixin must define `__slots__ = ()`). See msgspec issue #631 / PR #635.

## Checklist for "can use gc=False"

- [ ] Struct and everything it references can never participate in a reference cycle.
- [ ] No mutation of struct fields after construction that could introduce a cycle (frozen or init-only mutation is ok; scalar assignment is ok).
- [ ] Any list/dict/set fields are never mutated after creation, or we do not use gc=False.
- [ ] No storing the struct (or anything that references it) inside its own container fields.
- [ ] If Generic/mixins are used, `__slots__` compatibility is satisfied.

## Checklist for "must NOT use gc=False"

- [ ] Struct is mutated after creation in a way that could create a cycle (e.g. appending self to a list field).
- [ ] Container fields are mutated after creation and could hold the struct or back-references.
- [ ] Struct is used in a pattern where it's stored in a container that the struct (or its fields) also references.

## Quick per-struct analysis steps

1. List all fields and their types (scalars vs containers vs nested Structs).
2. Search the codebase for: assignments to this struct's fields, mutations of its container fields (`.append`, `.update`, etc.), and any place the struct instance is stored (e.g. in a list/dict that might be referenced by the struct).
3. If only scalars or immutable types, or frozen with no container mutation → likely safe for gc=False.
4. If mutable containers and they're never mutated (and never made to reference the struct) → likely safe; otherwise → do not use gc=False.

## Risky structs: audit and at-risk comment

A struct is **risky** for gc=False if it has a condition that would normally disallow gc=False (e.g. mutable list/dict/set fields), but that condition might never arise in practice (e.g. the field is only ever read, never mutated after construction).

### Auditing a risky struct

1. Identify the at-risk condition (e.g. "has `metadata: dict` that could be mutated").
2. Search the codebase for all uses of that struct and of the at-risk field:
- Any assignment to the field: `obj.field = ...`, `obj.field[key] = ...`, `obj.field.append(...)`, `obj.field.update(...)`, etc.
- Any code path that could store the struct (or something holding it) inside that container.
3. If the audit finds **no** such mutation or cycle-creating storage, the condition never arises and gc=False is acceptable **provided** you add the at-risk marker so future changes are re-audited.

### When audit passes

- Set `gc=False` on the struct.
- Add an **at-risk comment** and docstring note:

- **Above the class**: a short comment stating why gc=False is used despite the at-risk condition, and when the audit was done (e.g. `# gc=False: audit YYYY-MM: <condition> is only read, never mutated.`).
- **In the docstring**: a line that signals to future readers and to this skill that changes touching this struct must be re-audited. Use this format:

`AT-RISK (gc=False): Has <brief condition>. Any change that <what would violate safety> must be audited; if so, remove gc=False.`

- Example (for a struct with a `metadata` dict that is only ever read):

```python
# gc=False: audit 2026-03: metadata dict is only ever read, never mutated after construction.
class QueryResult(msgspec.Struct, ..., gc=False):
"""Result of a completed inference query.

AT-RISK (gc=False): Has mutable container field `metadata`. Any change that
mutates `metadata` after construction or stores this struct in a container
referenced by this struct must be audited; if so, remove gc=False.
...
```

### When touching an at-risk struct

If you are adding or changing code that uses a struct marked AT-RISK (gc=False):

1. Re-run the audit for that struct (searches above).
2. If your change mutates the at-risk field(s) or creates a cycle (e.g. stores the struct in its own container), **remove** `gc=False` from the struct and remove the at-risk comment/docstring line.
3. If your change does not touch the at-risk field or create cycles, the existing gc=False and at-risk comment remain; you may add a short note in the at-risk comment if the audit was re-checked (e.g. update the audit date).

## References

- [msgspec Structs – Disabling Garbage Collection](https://jcristharif.com/msgspec/structs.html#struct-gc)
- [msgspec Performance Tips – Use gc=False](https://jcristharif.com/msgspec/perf-tips.html#use-gc-false)
- [msgspec #631 – Generic structs and gc=False](https://github.com/jcrist/msgspec/issues/631)
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -189,5 +189,10 @@ outputs/
# Example vLLM virtualenv
examples/03_BenchmarkComparison/vllm_venv/

# Cursor artifacts (local development only)
# Agent artifacts (local development only)
.cursor_artifacts/
.claude/agent-memory/

# User-specific local rules (local Docker dev); do not commit
.cursor/rules/local-docker-dev.mdc
CLAUDE.local.md
22 changes: 10 additions & 12 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ CLI is auto-generated from `config/schema.py` Pydantic models via cyclopts. Fiel

- **CLI mode** (`offline`/`online`): cyclopts constructs `OfflineBenchmarkConfig`/`OnlineBenchmarkConfig` (subclasses in `config/schema.py`) directly from CLI args. Type locked via `Literal`. `--dataset` is repeatable with TOML-style format `[perf|acc:]<path>[,key=value...]` (e.g. `--dataset data.csv,samples=500,parser.prompt=article`). Full accuracy support via `accuracy_config.eval_method=pass_at_1` etc.
- **YAML mode** (`from-config`): `BenchmarkConfig.from_yaml_file()` loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional `--timeout`/`--mode` overrides via `config.with_updates()`.
- **eval**: Not yet implemented (raises `NotImplementedError`)
- **eval**: Not yet implemented (raises `CLIError` with a tracking issue link)

### Config Construction & Validation

Expand Down Expand Up @@ -137,7 +137,11 @@ src/inference_endpoint/
│ └── utils.py # Port range helpers
├── async_utils/
│ ├── loop_manager.py # LoopManager (uvloop + eager_task_factory)
│ ├── runner.py # run_async() — uvloop + eager_task_factory entry point for CLI commands
│ ├── event_publisher.py # Async event pub/sub
│ ├── services/
│ │ ├── event_logger/ # EventLoggerService: writes EventRecords to JSONL/SQLite
│ │ └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
│ └── transport/ # ZMQ-based IPC transport layer
│ ├── protocol.py # Transport protocols + TransportConfig base
│ ├── record.py # Transport records
Expand Down Expand Up @@ -192,26 +196,20 @@ tests/

## Development Standards

### Code Style
### Code Style and Pre-commit Hooks

- **Formatter/Linter**: `ruff` (line-length 88, target Python 3.12)
- **Type checking**: `mypy` (via pre-commit)
- **Formatting**: `ruff-format` (double quotes, space indent)
- **License headers**: Required on all Python files (enforced by pre-commit hook `scripts/add_license_header.py`)
- **Conventional commits**: `feat:`, `fix:`, `docs:`, `test:`, `chore:`

### Pre-commit Hooks

All of these run automatically on commit:

- trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements
- `ruff` (lint + autofix) and `ruff-format`
- `mypy` type checking
- `prettier` for YAML/JSON/Markdown
- License header enforcement
All of these hooks run automatically on commit: trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements, `ruff` (lint + autofix), `ruff-format`, `mypy`, `prettier` (YAML/JSON/Markdown), license header enforcement.

**Always run `pre-commit run --all-files` before committing.**

See [Development Guide](docs/DEVELOPMENT.md) for full setup and workflow details.

### Data Types & Serialization

- **Core types** (`Query`, `QueryResult`, `StreamChunk`): `msgspec.Struct` with `frozen=True`, `array_like=True`, `gc=False`, `omit_defaults=True`
Expand Down Expand Up @@ -291,7 +289,7 @@ Update AGENTS.md as part of any PR that includes a **significant refactor**, mea
- **Added or removed CLI commands/subcommands** — update CLI Modes and Common Commands
- **Changed test infrastructure** (new fixtures, changed markers, new test directories) — update Testing section
- **Added or removed key dependencies** — update Key Dependencies table
- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update Code Style and Pre-commit Hooks
- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md)
- **Changed hot-path patterns** (new transport, changed serialization, new performance constraints) — update Performance Guidelines

### How to Update
Expand Down
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Full guidance is maintained in AGENTS.md (shared with all AI coding agents) and is included below:

@AGENTS.md
2 changes: 2 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ Generally we encourage people to become MLCommons members if they wish to contri
Regardless of whether you are a member, your organization (or you as an individual contributor) needs to sign the MLCommons Contributor License Agreement (CLA). Please submit your GitHub username to the [MLCommons Subscription form](https://mlcommons.org/community/subscribe/) to start that process.

MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your pull requests.

For project-specific development standards (code style, test requirements, pre-commit hooks, commit format), see the [Development Guide](docs/DEVELOPMENT.md).
58 changes: 38 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ inference-endpoint benchmark offline \

```bash
# Start local echo server
python -m inference_endpoint.testing.echo_server --port 8765 &
python3 -m inference_endpoint.testing.echo_server --port 8765 &

# Test with dummy dataset (included in repo)
inference-endpoint benchmark offline \
Expand Down Expand Up @@ -94,33 +94,51 @@ pytest -m "not performance and not run_explicitly"

## 📚 Documentation

- [AGENTS.md](AGENTS.md) - Architecture, conventions, and AI agent guidelines
- [CLI Quick Reference](docs/CLI_QUICK_REFERENCE.md) - Command-line interface guide
- [Local Testing Guide](docs/LOCAL_TESTING.md) - Test with echo server
- [Development Guide](docs/DEVELOPMENT.md) - How to contribute and develop
- [Performance Architecture](docs/PERF_ARCHITECTURE.md) - Hot-path design and tuning
- [Performance Tuning](docs/CLIENT_PERFORMANCE_TUNING.md) - CPU affinity and client tuning
- [GitHub Setup Guide](docs/GITHUB_SETUP.md) - GitHub authentication and setup

### Component Design Specs

Each top-level component under `src/inference_endpoint/` has a corresponding spec:

| Component | Spec |
| ----------------- | ---------------------------------------------------------------- |
| Core types | [docs/core/DESIGN.md](docs/core/DESIGN.md) |
| Load generator | [docs/load_generator/DESIGN.md](docs/load_generator/DESIGN.md) |
| Endpoint client | [docs/endpoint_client/DESIGN.md](docs/endpoint_client/DESIGN.md) |
| Metrics | [docs/metrics/DESIGN.md](docs/metrics/DESIGN.md) |
| Config | [docs/config/DESIGN.md](docs/config/DESIGN.md) |
| Async utils | [docs/async_utils/DESIGN.md](docs/async_utils/DESIGN.md) |
| Dataset manager | [docs/dataset_manager/DESIGN.md](docs/dataset_manager/DESIGN.md) |
| Commands (CLI) | [docs/commands/DESIGN.md](docs/commands/DESIGN.md) |
| OpenAI adapter | [docs/openai/DESIGN.md](docs/openai/DESIGN.md) |
| SGLang adapter | [docs/sglang/DESIGN.md](docs/sglang/DESIGN.md) |
| Evaluation | [docs/evaluation/DESIGN.md](docs/evaluation/DESIGN.md) |
| Testing utilities | [docs/testing/DESIGN.md](docs/testing/DESIGN.md) |
| Profiling | [docs/profiling/DESIGN.md](docs/profiling/DESIGN.md) |
| Plugins | [docs/plugins/DESIGN.md](docs/plugins/DESIGN.md) |
| Utils | [docs/utils/DESIGN.md](docs/utils/DESIGN.md) |

## 🎯 Architecture

The system follows a modular, event-driven architecture:

```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Dataset │ │ Load │ │ Endpoint │
│ Manager │───▶│ Generator │───▶│ Client │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Metrics │ │ Configuration │ │ Endpoint │
│ Collector │◄───│ Manager │ │ (External) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
Metrics Collector
(event logging + reporting)
```

- **Load Generator**: Central orchestrator managing query lifecycle
- **Dataset Manager**: Handles benchmark datasets and preprocessing
- **Endpoint Client**: Abstract interface for endpoint communication
- **Metrics Collector**: Performance measurement and analysis
- **Configuration Manager**: System configuration (TBD)
- **Dataset Manager**: Loads benchmark datasets and applies transform pipelines
- **Load Generator**: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
- **Endpoint Client**: Multi-process HTTP worker pool communicating over ZMQ IPC
- **Metrics Collector**: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)

## Accuracy Evaluation

Expand All @@ -132,14 +150,13 @@ configuration. Currently, Inference Endpoints provides the following pre-defined
- LiveCodeBench (default: lite, release_v6)

However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the
[LiveCodeBench](src/inference_endpoint/dataset_manager/predefined/livecodebench/README.md) documentation
for details and explanations.
[LiveCodeBench](src/inference_endpoint/evaluation/livecodebench/README.md) documentation for
details and explanations.

## 🚧 Pending Features

The following features are planned for future releases:

- [ ] **Performance Tuning** - Advanced performance optimization features
- [ ] **Submission Ruleset Integration** - Full MLPerf submission workflow support
- [ ] **Documentation Generation and Hosting** - Sphinx-based API documentation with GitHub Pages

Expand All @@ -166,7 +183,8 @@ We are grateful to these communities for their contributions to LLM benchmarking

## 📄 License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE.md) file for
details.

## 🔗 Links

Expand Down
4 changes: 3 additions & 1 deletion docs/CLI_DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,11 @@ InputValidationError 2 Bad user input, invalid config
SetupError 3 Dataset load failure, connection error
ExecutionError 4 Benchmark failed after setup
CLIError 1 Generic CLI error (base class)
NotImplementedError 1 Unimplemented command (eval)
```

The reserved `eval` command currently raises `CLIError` with a tracking issue link rather than a
dedicated exception type.

## Development Guide

### Adding a CLI flag
Expand Down
Loading
Loading