Marlin downstream clean #26

robertgshaw2-redhat · 2024-02-18T20:33:39Z

Cleaned up version of Marlin PR that can be merged into main.

Added some E2E tests, which compare the results of the exllama kernels to the marlin kernels. See tests/models/test_marlin.py for more details

Some E2E performance data on a range of GPUs. TLDR: these kernels rock

…anch safe_expose_semi_structured_sparse_tensor

…re (eager_force=False)

…size by running multiple parallel problems of size 64. (2) Refactor the workspace to be dynamic per layer

…d issues with tensor parallel runs)

cleanup to undo autoformatting

robertgshaw2-redhat · 2024-02-18T20:56:41Z

To use:

from vllm import LLM, SamplingParams

model = LLM("robertgshaw2/TinyLlama-1.1B-Chat-v1.0-g128-marlin")

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text

vllm/model_executor/layers/sparsity/sparse_w16a16.py

tests/models/test_marlin.py

mgoin · 2024-02-19T00:47:38Z

tests/models/test_marlin.py

+result in very slight nondeterminism for Marlin. As a result, we re-run the test
+up to 3 times to see if we pass.
+
+Run `pytest tests/models/test_marlin.py --forked`.


What is the "forked" argument doing here?

Im not sure, all the other models tests have this arg though.

I've just been running pytest tests/models/test_marlin.py

tests/models/test_marlin.py

vllm/config.py

vllm/model_executor/layers/linear.py

vllm/model_executor/layers/parameters/sparsity.py

vllm/model_executor/layers/quantization/marlin.py

tlrmchlsmth

I think we'll need to include a license file for marlin somewhere

tlrmchlsmth · 2024-02-19T15:24:39Z

csrc/ops.h

+  torch::Tensor& a,
+  torch::Tensor& b_q_weight,
+  torch::Tensor& b_scales,


can these be const&?

@alexm-nm ?

robertgshaw2-redhat · 2024-02-19T16:30:18Z

I think we'll need to include a license file for marlin somewhere

@tlrmchlsmth

We have the following in the cuda code:

/*
 * Copyright (C) Marlin.2024 Elias Frantar (elias.frantar@ist.ac.at)
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *         http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

robertgshaw2-redhat · 2024-02-22T20:40:04Z

Closing in favor of #43

afeldman-nm and others added 30 commits February 18, 2024 20:12

.gitignore magic_wand dir

538e4d8

added 2:4 example (not actually using 2:4 yet\!)

3cef80c

use only cuda:0

12e728d

wip semi_structured_sparse_w16a16

1fac76d

restructuring sparsity

48ea1fc

difficulty creating sparse parameter class

80dc85d

difficulty creating sparse parameter class

2b9e686

first successful run with 2:4 sparse model; compat with magic_wand br…

db5e2eb

…anch safe_expose_semi_structured_sparse_tensor

fixes

7309d25

bfloat16

4304b5f

hopefully removed magic_wand submodule

ff926f1

small cleanup

98bedef

lint/format

2638536

marlin

e09d795

added marlin

b29221f

trying to load packed weights turning out to be tricky

c1743e5

trying to load packed weights turning out to be tricky due to qkv

3e95673

integrated marlin for single gpu

ed2dc26

Update llama.py

40b05cf

Fixes to Marlin quantization to allow execution via CUDA graphs captu…

7cef294

…re (eager_force=False)

Integrate @efrantar's changes for CUDA graphs

2157a56

review comments based on zhyncs

06f0f03

(1) Integrate the latest changes from Elias that improve large batch …

d1502f8

…size by running multiple parallel problems of size 64. (2) Refactor the workspace to be dynamic per layer

add bug fix

2b30e56

refactored some of alex's work to be consistent with the gptq config

18d3ee3

updated to load model based on hf_config from AutoGPTQ

2677880

Reduce Marlin's kernel limitation of thread_n from 256 to 64 (to avoi…

be67f2f

…d issues with tensor parallel runs)

Update checks related to MarlinConfig

a81b35e

formatting

8d432b1

Update ops.h

a626d2d

cleanup to undo autoformatting

alexm-redhat and others added 5 commits February 18, 2024 20:31

merge fix

8872703

sync

9be146f

cleanup

28f3755

trailing newline

7e8b7c8

added tests

06d569d

robertgshaw2-redhat assigned tlrmchlsmth, mgoin, LucasWilkinson and alexm-redhat Feb 18, 2024

This was referenced Feb 18, 2024

Add Marlin 4-bit GPTQ kernel support #7

Closed

Marlin downstream PR #13

Closed

rsnm2 added 2 commits February 18, 2024 21:04

removed bad merge

cd745a6

fixed main so I can pass ruff

6150459

robertgshaw2-redhat commented Feb 18, 2024

View reviewed changes

vllm/model_executor/layers/sparsity/sparse_w16a16.py Show resolved Hide resolved

tests are passing

1417d49

mgoin reviewed Feb 19, 2024

View reviewed changes

robertgshaw2-redhat and others added 6 commits February 18, 2024 21:22

Delete vllm/model_executor/layers/parameters/sparsity.py

b517680

Update linear.py

8127fba

Update marlin.py

d7a060e

Update test_marlin.py

efce8cf

Update marlin.py

5d3057c

stashing local changes

84ebe9c

tlrmchlsmth reviewed Feb 19, 2024

View reviewed changes

fixed test to use pytest.flaky

9852768

robertgshaw2-redhat requested review from mgoin and tlrmchlsmth February 19, 2024 16:31

robertgshaw2-redhat closed this Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Marlin downstream clean #26

Marlin downstream clean #26

Uh oh!

robertgshaw2-redhat commented Feb 18, 2024 •

edited

Loading

Uh oh!

robertgshaw2-redhat commented Feb 18, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mgoin Feb 19, 2024

Uh oh!

robertgshaw2-redhat Feb 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Uh oh!

tlrmchlsmth Feb 19, 2024

Uh oh!

robertgshaw2-redhat Feb 19, 2024

Uh oh!

robertgshaw2-redhat commented Feb 19, 2024

Uh oh!

robertgshaw2-redhat commented Feb 22, 2024

Uh oh!

Uh oh!

Marlin downstream clean #26

Marlin downstream clean #26

Uh oh!

Conversation

robertgshaw2-redhat commented Feb 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 19, 2024

Uh oh!

robertgshaw2-redhat commented Feb 22, 2024

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 18, 2024 •

edited

Loading

robertgshaw2-redhat commented Feb 18, 2024 •

edited

Loading