Skip to content

Commit 3c86db5

Browse files
committed
Fix
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
1 parent d47a589 commit 3c86db5

File tree

4 files changed

+16
-14
lines changed

4 files changed

+16
-14
lines changed

.pre-commit-config.yaml

+5-3
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,11 @@ repos:
3333
rev: v0.9.27
3434
hooks:
3535
- id: pymarkdown
36-
# NOTE: If you get an AssertionError when applying fixes,
37-
# try setting args to [scan] and fix the lint errors manually
38-
args: [fix]
36+
# Conflicts with pyml disable, so we flag this to be fixed manually
37+
args: [fix, -d, md007]
38+
hooks:
39+
- id: pymarkdown
40+
args: [scan]
3941
- repo: https://github.com/rhysd/actionlint
4042
rev: v1.7.7
4143
hooks:

csrc/quantization/machete/Readme.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -6,25 +6,25 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
66

77
Machete effectively performs
88

9-
```
9+
```python
1010
scale_type = w_s.dtype
1111
compute_type = a.dtype
1212
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
1313
```
1414

15-
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
15+
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
1616
`w_z` is the quantization zeropoints.
1717

18-
> **_NOTE:_** `w_z` is added after the scales so we can
18+
> **_NOTE:_** `w_z` is added after the scales so we can
1919
use FMA operations, but this means they must have the scales pre-applied if the
20-
supplied zeropoints assume that they will be subtracted before the scales are
20+
supplied zeropoints assume that they will be subtracted before the scales are
2121
applied.
2222

2323
## API
2424

2525
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
2626

27-
```
27+
```python
2828
from vllm import _custom_ops as ops
2929

3030
...
@@ -40,6 +40,6 @@ output = ops.machete_gemm(
4040

4141
## Code Generation
4242

43-
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
43+
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
4444

45-
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
45+
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.

docs/source/serving/engine_args.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
Below, you can find an explanation of every engine argument for vLLM:
66

7-
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
7+
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
88
```{eval-rst}
99
.. argparse::
1010
:module: vllm.engine.arg_utils
@@ -17,7 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
1717

1818
Below are the additional arguments related to the asynchronous engine:
1919

20-
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
20+
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
2121
```{eval-rst}
2222
.. argparse::
2323
:module: vllm.engine.arg_utils

examples/offline_inference/openai/openai_batch.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
182182

183183
Add embedding requests to your batch file. The following is an example:
184184

185-
```jsonl
185+
```text
186186
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
187187
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
188188
```
@@ -213,7 +213,7 @@ $ cat results.jsonl
213213

214214
Add score requests to your batch file. The following is an example:
215215

216-
```jsonl
216+
```text
217217
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
218218
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
219219
```

0 commit comments

Comments
 (0)