Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/en/examples/deepseek-r1.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ The final `--sglang-server-concurrency` is a parameter specific to miles. It is
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 64
--sglang-mem-fraction-static 0.7
--sglang-enable-ep-moe
----sglang-ep-size 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a typo in this argument. It starts with four dashes (----) instead of the standard two (--), which will cause the command to fail when executed.

Suggested change
----sglang-ep-size 64
--sglang-ep-size 64


# dp attention
--sglang-enable-dp-attention
Expand All @@ -186,7 +186,7 @@ SGLANG_ARGS=(
--sglang-enable-dp-lm-head

# enable deepep for sglang
--sglang-enable-deepep-moe
--sglang-moe-a2a-backend deepep
--sglang-deepep-mode auto

# make every dp rank have 128 concurrency
Expand Down
4 changes: 2 additions & 2 deletions docs/en/examples/qwen3-30B-A3B.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Here, we will briefly introduce the MoE-related parts in the [run-qwen3-30B-A3B.
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 8
--sglang-mem-fraction-static 0.7
--sglang-enable-ep-moe
--sglang-ep-size 8
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
)
```
Expand Down Expand Up @@ -109,7 +109,7 @@ In addition, you can make the following changes:
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 24
--sglang-mem-fraction-static 0.7
--sglang-enable-ep-moe
--sglang-ep-size 24
--sglang-enable-dp-attention
--sglang-dp-size 3

Expand Down
66 changes: 3 additions & 63 deletions docs/en/get_started/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
When using miles, parameters are primarily passed for the following purposes:

1. To allocate a portion of the GPUs in the cluster for training and another portion for inference.
2. To load Megatron or FSDP for the training portion.
2. To load Megatron for the training portion.
3. To load SGLang for the inference portion.
4. To configure the hyperparameters required for RL training.

Expand Down Expand Up @@ -35,7 +35,7 @@ Additionally, miles supports Prefill and Decode disaggregation (PD Disaggregatio
miles supports multiple training backends, which can be selected via the `--train-backend` parameter:

- `megatron` (default): Uses Megatron-LM as the training backend, supporting efficient training of large-scale models.
- `fsdp`: Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.
- `fsdp` (experimental): Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.

### Loading Megatron

Expand Down Expand Up @@ -280,7 +280,7 @@ miles incorporates almost all SGLang parameters by using SGLang's `ServerArgs.ad

- In co-located training and inference, you often need to limit `--mem-fraction-static`. This parameter should be changed to `--sglang-mem-fraction-static`.
- During training, if you want SGLang to infer beyond the maximum context length specified in the Hugging Face checkpoint's `config.json`, you need to use `--context-length`, which becomes `--sglang-context-length` in miles.
- For multi-node large EP inference, you might need `--enable-ep-moe`, `--enable-dp-attention`, `--dp-size`, `--enable-deepep-moe`, etc. These can be passed as `--sglang-enable-ep-moe`, `--sglang-enable-dp-attention`, `--sglang-dp-size`, and `--sglang-enable-deepep-moe` respectively.
- For multi-node large EP inference, you might need `--ep-size`, `--enable-dp-attention`, `--dp-size`, `--moe-a2a-backend deepep`, etc. These can be passed as `--sglang-ep-size`, `--sglang-enable-dp-attention`, `--sglang-dp-size`, and `--sglang-moe-a2a-backend deepep` respectively.

Some parameters related to miles's resource scheduling are configured by miles itself, for example:

Expand Down Expand Up @@ -322,63 +322,3 @@ In some customized Megatron implementations, special operations need to be perfo
- `--custom-megatron-init-path`: Adds some initialization calls.
- `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability.
- `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.

## How to Use FSDP

miles also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/).

> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step.

To run FSDP as the training backend, pass `--train-backend fsdp` to enable.

### Parameters

Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way.

| Configuration Category | Megatron Parameter | FSDP Parameter | Description |
| --- | --- | --- | --- |
| **Model Loading** | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) | `--hf-checkpoint` (Required) | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` |
| **Tensor Parallel** | `--tensor-model-parallel-size` | Coming Soon | |
| **Pipeline Parallel** | `--pipeline-model-parallel-size` | Coming Soon | |
| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | |
| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP |
| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter |
| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine etc.) | `--lr-decay-style` | Same parameter |
| **Warmup** | `--lr-warmup-iters` (steps) | `--lr-warmup-iters` | Same parameter |
| **Min Learning Rate** | `--min-lr` | `--min-lr` | Same parameter |
| **Optimizer Type** | `--optimizer` (adam/sgd etc.) | `--optimizer` (default adam) | Basically same |
| **Distributed Optimizer** | `--use-distributed-optimizer` | Built-in to FSDP | FSDP uses distributed optimizer by default |
| **Gradient Checkpoint** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: Simplified to boolean switch |
| **CPU Offload** | Implemented via distributed optimizer | `--fsdp-cpu-offload` | **FSDP**: Offload parameters/gradients/optimizer states to CPU |
| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace |
| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same |
| **Training Backend** | Default or `--train-backend megatron` | `--train-backend fsdp` (Required) | Used to switch backend |
| **Config** | | `--config` | **FSDP**: Set additional parameters for FSDP backend |

### Quick Start

```bash
# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance
# Download model weights (Qwen3-4B)
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

# Download training dataset (dapo-math-17k)
hf download --repo-type dataset zhuzilin/dapo-math-17k \
--local-dir /root/dapo-math-17k

# Download evaluation dataset (aime-2024)
hf download --repo-type dataset zhuzilin/aime-2024 \
--local-dir /root/aime-2024

# Clone code and install dependencies
git clone https://github.com/radixark/miles.git
cd miles
pip install -e . --no-deps


# FSDP does not require weight conversion, natively supports huggingface format
# Enable reference model, train Qwen3-4B in colocate mode
source /root/miles/scripts/run-qwen3-4B-fsdp.sh
```

4 changes: 2 additions & 2 deletions scripts/run-deepseek-r1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ WANDB_ARGS=(
SGLANG_ARGS=(
--rollout-num-gpus-per-engine 64
--sglang-mem-fraction-static 0.7
--sglang-enable-ep-moe
--sglang-ep-size 64

# dp attention
--sglang-enable-dp-attention
Expand All @@ -122,7 +122,7 @@ SGLANG_ARGS=(
--sglang-enable-dp-lm-head

# enable deepep for sglang
--sglang-enable-deepep-moe
--sglang-moe-a2a-backend deepep
--sglang-deepep-mode auto

# make every dp rank has 128 concurrency
Expand Down
4 changes: 2 additions & 2 deletions scripts/run-kimi-k2-Instruct.sh
Original file line number Diff line number Diff line change
Expand Up @@ -128,8 +128,8 @@ SGLANG_ARGS=(
--sglang-ep-size 16

# enable deepep for sglang
# --sglang-enable-deepep-moe
# --sglang-deepep-mode auto
# --sglang-moe-a2a-backend deepep
# --sglang-deepep-mode auto

# make every dp rank has 128 concurrency
--sglang-server-concurrency 1024
Expand Down
2 changes: 1 addition & 1 deletion scripts/run-kimi-k2-Thinking.sh
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ SGLANG_ARGS=(
--sglang-ep-size 16

# enable deepep for sglang
# --sglang-enable-deepep-moe
# --sglang-moe-a2a-backend deepep
# --sglang-deepep-mode auto

# make every dp rank has 128 concurrency
Expand Down
1 change: 0 additions & 1 deletion scripts/run-qwen3-32B.sh
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,6 @@ SGLANG_ARGS=(
--rollout-num-gpus-per-engine 8
--sglang-mem-fraction-static 0.7
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
# --sglang-enable-ep-moe
)

MISC_ARGS=(
Expand Down