-
Notifications
You must be signed in to change notification settings - Fork 9
Marlin downstream clean #26
Conversation
…anch safe_expose_semi_structured_sparse_tensor
…re (eager_force=False)
…size by running multiple parallel problems of size 64. (2) Refactor the workspace to be dynamic per layer
…d issues with tensor parallel runs)
cleanup to undo autoformatting
To use: from vllm import LLM, SamplingParams
model = LLM("robertgshaw2/TinyLlama-1.1B-Chat-v1.0-g128-marlin")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text |
result in very slight nondeterminism for Marlin. As a result, we re-run the test | ||
up to 3 times to see if we pass. | ||
|
||
Run `pytest tests/models/test_marlin.py --forked`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "forked" argument doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure, all the other models tests have this arg though.
I've just been running pytest tests/models/test_marlin.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to include a license file for marlin somewhere
torch::Tensor& a, | ||
torch::Tensor& b_q_weight, | ||
torch::Tensor& b_scales, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be const&
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexm-nm ?
We have the following in the cuda code:
|
Closing in favor of #43 |
Cleaned up version of Marlin PR that can be merged into main.
Added some E2E tests, which compare the results of the exllama kernels to the marlin kernels. See
tests/models/test_marlin.py
for more details