Skip to content

Commit d85c47d

Browse files
authored
Replace "online inference" with "online serving" (#11923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
1 parent ef725fe commit d85c47d

File tree

11 files changed

+16
-16
lines changed

11 files changed

+16
-16
lines changed

.buildkite/run-cpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ function cpu_tests() {
6161
pytest -s -v -k cpu_model \
6262
tests/basic_correctness/test_chunked_prefill.py"
6363

64-
# online inference
64+
# online serving
6565
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
6666
set -e
6767
export VLLM_CPU_KVCACHE_SPACE=10

docs/source/features/structured_outputs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
66
This document shows you some examples of the different options that are available to generate structured outputs.
77

8-
## Online Inference (OpenAI API)
8+
## Online Serving (OpenAI API)
99

1010
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
1111

@@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are:
239239
- `backend`
240240
- `whitespace_pattern`
241241

242-
These parameters can be used in the same way as the parameters from the Online Inference examples above.
242+
These parameters can be used in the same way as the parameters from the Online Serving examples above.
243243
One example for the usage of the `choices` parameter is shown below:
244244

245245
```python

docs/source/getting_started/installation/hpu-gaudi.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ $ python setup.py develop
8383
## Supported Features
8484

8585
- [Offline inference](#offline-inference)
86-
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
86+
- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
8787
- HPU autodetection - no need to manually select device within vLLM
8888
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
8989
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -385,5 +385,5 @@ the below:
385385
completely. With HPU Graphs disabled, you are trading latency and
386386
throughput at lower batches for potentially higher throughput on
387387
higher batches. You can do that by adding `--enforce-eager` flag to
388-
server (for online inference), or by passing `enforce_eager=True`
388+
server (for online serving), or by passing `enforce_eager=True`
389389
argument to LLM constructor (for offline inference).

docs/source/getting_started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
This guide will help you quickly get started with vLLM to perform:
66

77
- [Offline batched inference](#quickstart-offline)
8-
- [Online inference using OpenAI-compatible server](#quickstart-online)
8+
- [Online serving using OpenAI-compatible server](#quickstart-online)
99

1010
## Prerequisites
1111

docs/source/models/generative_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template)
118118
outputs = llm.chat(conversation, chat_template=custom_template)
119119
```
120120

121-
## Online Inference
121+
## Online Serving
122122

123123
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
124124

docs/source/models/pooling_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ print(f"Score: {score}")
127127

128128
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
129129

130-
## Online Inference
130+
## Online Serving
131131

132132
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
133133

docs/source/models/supported_models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod
552552

553553
````{important}
554554
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
555-
or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
555+
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
556556
557557
Offline inference:
558558
```python
@@ -562,7 +562,7 @@ llm = LLM(
562562
)
563563
```
564564
565-
Online inference:
565+
Online serving:
566566
```bash
567567
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
568568
```

docs/source/serving/multimodal_inputs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ for o in outputs:
199199
print(generated_text)
200200
```
201201

202-
## Online Inference
202+
## Online Serving
203203

204204
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
205205

examples/online_serving/openai_chat_completion_client_for_multimodal.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""An example showing how to use vLLM to serve multimodal models
2-
and run online inference with OpenAI client.
2+
and run online serving with OpenAI client.
33
44
Launch the vLLM server with the following command:
55
@@ -309,7 +309,7 @@ def main(args) -> None:
309309

310310
if __name__ == "__main__":
311311
parser = FlexibleArgumentParser(
312-
description='Demo on using OpenAI client for online inference with '
312+
description='Demo on using OpenAI client for online serving with '
313313
'multimodal language models served with vLLM.')
314314
parser.add_argument('--chat-type',
315315
'-c',

tests/models/decoder_only/audio_language/test_ultravox.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str,
237237

238238

239239
@pytest.mark.asyncio
240-
async def test_online_inference(client, audio_assets):
241-
"""Exercises online inference with/without chunked prefill enabled."""
240+
async def test_online_serving(client, audio_assets):
241+
"""Exercises online serving with/without chunked prefill enabled."""
242242

243243
messages = [{
244244
"role":

vllm/model_executor/models/molmo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs):
10681068
trust_remote_code=model_config.trust_remote_code)
10691069

10701070
# NOTE: message formatting for raw text prompt is only applied for
1071-
# offline inference; for online inference, the prompt is always in
1071+
# offline inference; for online serving, the prompt is always in
10721072
# instruction format and tokenized.
10731073
if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$",
10741074
prompt):

0 commit comments

Comments
 (0)