-
Notifications
You must be signed in to change notification settings - Fork 609
Add buffer cacheline size metric #4228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4228
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit a682d20 with merge base 4b45264 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D59649561 |
This pull request was exported from Phabricator. Differential Revision: D59649561 |
Summary: Pull Request resolved: pytorch#4228 {F1753540374} Differential Revision: D59649561
1f17318
to
35f2422
Compare
This pull request was exported from Phabricator. Differential Revision: D59649561 |
35f2422
to
b9edf24
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. The experiment first calculates the number of iterations (NITER) that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Differential Revision: D59649561
This pull request was exported from Phabricator. Differential Revision: D59649561 |
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. The experiment first calculates the number of iterations (NITER) that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Differential Revision: D59649561
b9edf24
to
c59e1be
Compare
This pull request was exported from Phabricator. Differential Revision: D59649561 |
c59e1be
to
24dc3cb
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. The experiment first calculates the number of iterations (NITER) that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Differential Revision: D59649561
This pull request was exported from Phabricator. Differential Revision: D59649561 |
24dc3cb
to
ae0624f
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. The experiment first calculates the number of iterations (NITER) that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Differential Revision: D59649561
This pull request was exported from Phabricator. Differential Revision: D59649561 |
ae0624f
to
54287be
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Some useful concept definitions: NITER: The number of iterations that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. PITCH: A number of bytes of separation between cache lines that ensures that all concurrent groups are being used, and therefore a fetch from two different cache lines is sure to have a latency increase. STRIDE: The actual size of the cache line that will be obtained experimentally. Increasing this until it reaches the cache line size should show a latency increase, giving us the result we look for. Differential Revision: D59649561
This pull request was exported from Phabricator. Differential Revision: D59649561 |
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Some useful concept definitions: NITER: The number of iterations that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. PITCH: A number of bytes of separation between cache lines that ensures that all concurrent groups are being used, and therefore a fetch from two different cache lines is sure to have a latency increase. STRIDE: The actual size of the cache line that will be obtained experimentally. Increasing this until it reaches the cache line size should show a latency increase, giving us the result we look for. Differential Revision: D59649561
54287be
to
3f0cf21
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Some useful concept definitions: NITER: The number of iterations that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. PITCH: A number of bytes of separation between cache lines that ensures that all concurrent groups are being used, and therefore a fetch from two different cache lines is sure to have a latency increase. STRIDE: The actual size of the cache line that will be obtained experimentally. Increasing this until it reaches the cache line size should show a latency increase, giving us the result we look for. Differential Revision: D59649561
This pull request was exported from Phabricator. Differential Revision: D59649561 |
3f0cf21
to
a682d20
Compare
Summary: Pull Request resolved: pytorch#4228 This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect. [This article](https://igoro.com/archive/gallery-of-processor-cache-effects/) has more information on the topic. The experiment first calculates the number of iterations (NITER) that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors. Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable `zero` that will force the compiler to avoid optimizing the for loop. The experiment will look like this: {F1754670481} Differential Revision: D59649561
This pull request has been merged in 6903715. |
This diff introduces a metric to GPUInfo that calculates the cacheline size of the buffer data pathway. In this experiment, all threads read from the cache with a varying stride. Reading two values from the same cacheline is cheap because the whole line is fetched as a block, regardless of which data we actually want. By varying the separation between the addresses of these two values, there will be a point where the shader will be forced to fetch two separate cachelines, which will have an effect in latency that we can detect.
This article has more information on the topic.
Each run of the shader fetches the two values from different points in memory. The shader also has a seemingly redundant variable
zero
that will force the compiler to avoid optimizing the for loop.The experiment will look like this:
Some useful concept definitions:
NITER: The number of iterations that would take the lowest stride to run in 1000 microseconds. All experiments will then run this number of times. This is to have a timing baseline and avoid timing errors.
PITCH: A number of bytes of separation between cache lines that ensures that all concurrent groups are being used, and therefore a fetch from two different cache lines is sure to have a latency increase.
STRIDE: The actual size of the cache line that will be obtained experimentally. Increasing this until it reaches the cache line size should show a latency increase, giving us the result we look for.
Differential Revision: D59649561