Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Sparse fused gemm integration #12

Merged
merged 23 commits into from
Feb 14, 2024

Conversation

LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 14, 2024

Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need to ensure that we compress the weight matrix only once and never decompress it, as decompression is currently unsupported.

Before this change, using SparseParameter(SparseTensor) meant that in MergedColumnParallelLinear and QKVParallelLinear every time a new shard was loaded by the weight_loader (e.g., the "q" portion of QKVParallelLinear), we would decompress the tensor in-order to use narrow to update the appropriate section of the weight tensor. With this change, SparseParameter(SparseTensor) is replaced with LazyCompressedParameter, which allows us to operate on uncompressed_data until we explicitly compress it. At that point, the uncompressed_data is compressed into compressed_data and freed. Currently, the detection of when to call compress is somewhat hacky. For QKVParallelLinear, we compress only after inserting "q", "k", and "v" shard ids, and for MergedColumnParallelLinear, we compress once we've inserted the same number of shards as outputs (determined by len(output_sizes)), which implicitly assumes one shard per output.

Moving away from SparseParameter(SparseTensor) means that SparseTensor no longer handles dispatching to the custom ops; instead, this is handled by SparseW16A16LinearMethod. I believe this is a positive change overall. SparseTensor was an unnecessary extra layer of abstraction/indirection originally designed for the SLoRA work, not vLLM.

This did result in the 2:4 sparse implementation breaking. However, it turns out it was already broken (i.e., it was decompressing and running dense within SparseTensor), so we "disable" it for now ("disable" meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

@afeldman-nm
Copy link

afeldman-nm commented Feb 14, 2024

Thanks for the heads-up Re: 2-4 @LucasWilkinson . The torch.Tensor dispatch mechanism is I agree unwieldy for our purposes (a bit too powerful for what we want to do, given that in vLLM we are almost exclusively dispatching to the linear operator.) I agree moving past it is a good choice. For example, the failure of the compressed 2:4 kernel to be invoked by vLLM is an issue with how I integrated 2:4 into SparseTensor's dispatch.

In a separate PR I will address the issue with the 2:4 integration. For now deactivating 2:4 is a good choice.

@LucasWilkinson LucasWilkinson merged commit 4f225b4 into main Feb 14, 2024
2 checks passed
@LucasWilkinson LucasWilkinson deleted the lwilkinson/sparse-fused-gemm-integration branch February 14, 2024 22:53
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 20, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 20, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 21, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
tlrmchlsmth pushed a commit that referenced this pull request Feb 21, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 21, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 22, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
robertgshaw2-neuralmagic pushed a commit that referenced this pull request Feb 22, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants