You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: csrc/quantization/machete/Readme.md
+7-7
Original file line number
Diff line number
Diff line change
@@ -6,25 +6,25 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
6
6
7
7
Machete effectively performs
8
8
9
-
```
9
+
```python
10
10
scale_type = w_s.dtype
11
11
compute_type = a.dtype
12
12
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
13
13
```
14
14
15
-
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
15
+
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
16
16
`w_z` is the quantization zeropoints.
17
17
18
-
> **_NOTE:_**`w_z` is added after the scales so we can
18
+
> **_NOTE:_**`w_z` is added after the scales so we can
19
19
use FMA operations, but this means they must have the scales pre-applied if the
20
-
supplied zeropoints assume that they will be subtracted before the scales are
20
+
supplied zeropoints assume that they will be subtracted before the scales are
21
21
applied.
22
22
23
23
## API
24
24
25
25
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
26
26
27
-
```
27
+
```python
28
28
from vllm import _custom_ops as ops
29
29
30
30
...
@@ -40,6 +40,6 @@ output = ops.machete_gemm(
40
40
41
41
## Code Generation
42
42
43
-
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
43
+
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
44
44
45
-
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
45
+
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
Add embedding requests to your batch file. The following is an example:
184
184
185
-
```jsonl
185
+
```text
186
186
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
187
187
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
188
188
```
@@ -213,7 +213,7 @@ $ cat results.jsonl
213
213
214
214
Add score requests to your batch file. The following is an example:
215
215
216
-
```jsonl
216
+
```text
217
217
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
218
218
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
0 commit comments