-
Notifications
You must be signed in to change notification settings - Fork 607
[ET-VK] Deprecate gpu_sizes_ubo()
and toggle packing layout via specialization shader
#3181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…cialization shader ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3181
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 6859982 with merge base 36453fc ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…cialization shader ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) ghstack-source-id: 223274863 Pull Request resolved: #3181
This pull request was exported from Phabricator. Differential Revision: D56377775 |
…out via specialization shader" ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes Currently there are two representations of the tensor's sizes; `cpu_sizes()` and the `gpu_sizes()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) [ghstack-poisoned]
…ing layout via specialization constants Pull Request resolved: #3181 ## Context This changeset cleans up how shaders consume tensor metadata in two ways: ### Pass in Packing Layout via Specialization Shader The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size. Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour. The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size. Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader. ### Deprecate GPU sizes and Extents Currently there are 3 representations of the tensor's sizes; `cpu_sizes()`, `gpu_sizes()`, and `extents()`. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4. Extents represents the physical extents of the image texture used to store the image. However, often times shaders need to reference the original sizes of the tensor so we end up passing two different representations of the tensor sizes. The CPU sizes and extents is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates. Since the GPU sizes and extents are easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence `cpu_sizes()` is renamed to simple `sizes()`. The primary benefit of this change is such: 1. Less confusion over how to reference the tensor sizes 2. Fewer descriptors to bind when constructing compute pipelines 3. Fewer uniform buffers to update when resizing tensors between inferences. Differential Revision: [D56377775](https://our.internmc.facebook.com/intern/diff/D56377775/) ghstack-source-id: 223317313
This pull request was exported from Phabricator. Differential Revision: D56377775 |
This pull request has been merged in c350e58. |
## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/) [ghstack-poisoned]
## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. Differential Revision: [D56435574](https://our.internmc.facebook.com/intern/diff/D56435574/) ghstack-source-id: 223453651 Pull Request resolved: #3217
Summary: Pull Request resolved: #3217 ## Context #3181 deprecated the `gpu_sizes_ubo()` and `extents_ubo()` functions of `vTensor` in order to standardize how shaders consume shape/size metadata of input tensors. However, this came at the cost of increasing the overhead required for bounds checking, which is needed to support dynamic shapes as shaders now needed to convert the input sizes to texture limits before checking if a given texel position is out of bounds. Benchmarking revealed that this overhead can be quite significant especially on lower power mobile GPUs. In the interest of preserving performance, `extents_ubo()` is re-introduced since bounds checking is an operation that is common to every single shader. However, some improvements are made: * instead of `extents`, the nomenclature `texture_limits` is used in order to differentiate from physical image extents of the texture. * `texture_limits` is represented via an `ivec3` (previously `uvec4`); this means that to use it for bounds checking, there does not need to be an implicit cast to from `uvec` to `ivec` and there is also no need for swizzling. Also introduced in this changeset is the convention of passing both the texture limits and tensor sizes instead of using `pos_out_of_bounds()`. Passing in the texture limits is probably cheaper than using `pos_out_of_bounds()`. There are some exceptions though where I choose not to migrate to this pattern to avoid passing in too many variants of tensor metadata. ### What about `gpu_sizes_ubo`? I will hold off on re-introducing `gpu_sizes_ubo` for now since converting `sizes` to `gpu_sizes` is much cheaper compared to `pos_out_of_bounds()`: ``` ivec4 sizes[packed_dim] = alignup4(sizes[packed_dim]) ``` Will perform some additional benchmarking on this to see if the overhead of the alignment warrants an explicit API for passing in GPU sizes to shaders. ghstack-source-id: 223453651 exported-using-ghexport Reviewed By: yipjustin, jorgep31415 Differential Revision: D56435574 fbshipit-source-id: 656f79eecbfc7c77cbe067df6c9ea54c51c50633
Stack from ghstack (oldest at bottom):
gpu_sizes_ubo()
and toggle packing layout via specialization shader #3181Context
This changeset cleans up how shaders consume tensor metadata in two ways:
Pass in Packing Layout via Specialization Shader
The packing layout of a tensor determines how to convert between tensor indices and physical texture coordinates. Currently, the packing layout is determined by generating a completely new variant of a shader. However, this is rather expensive for build size.
Specialization constants support was added a while back, which enables packing layout to be communicated to the shader via a specialization constant. This is a much better and natural way for shaders to determine the packing layout of its tensors and vary its behaviour.
The primary benefit of this is that we can vastly reduce the number of variants that are generated. Generating shader variants for combinations of dtypes and memory layouts can lead to combinatorial explosion of build size.
Note that dtype cannot be passed as a specialization constant since it impacts the types used in the layout portion of a shader.
Deprecate GPU sizes
Currently there are two representations of the tensor's sizes;
cpu_sizes()
and thegpu_sizes()
. The GPU sizes is a simple modification of the CPU sizes where the packed dim is aligned to the next multiple of 4.However, often times shaders need to reference the original sizes of the tensor so we end up passing both CPU sizes and GPU sizes. The CPU sizes is used to determine out of bounds elements and the GPU sizes is used to convert between logical tensor indices and physical texture coordinates.
Since the GPU sizes is easily determined from the CPU sizes given the packing layout, deprecate GPU sizes and use CPU sizes exclusively as the canonical tensor sizes. Hence
cpu_sizes()
is renamed to simplesizes()
.The primary benefit of this change is such:
Differential Revision: D56377775