Use int64 in pdist kernel to handle batches >= 46342 #30583 #31593

ptrblck · 2019-12-24T06:25:23Z

Currently torch.pdist yields an illegal CUDA memory access for batch sizes >= 46342 as reported by @ssnl in #30583.
Thanks for the minimal code reproduction, btw! ;)

Reason for this bug:
The calculation if i in the pdist_kerne_cuda_impl might overflow, if a tensor with a batch size >= 46342 is passed to torch.pdist.

Detailed description:

result is resizes as n * (n - 1) / 2 = 1073767311 (line of code)
grid is initialized as result.numel() (line of code)
k is assigned to the blockIdx.x as an int32 (line of code)
i is calculated using 2 * k >= 2147534622 (line of code), which overflows, since 2147534622 > 2147483647 (int32_max).

Using const int64_t k = blockIdx.x; would solve the illegal memory access. This seems also be done for cdist_kernel_cuda_impl.

However, we might expect a slowdown, so I've timed the current PyTorch master vs. this PR:
(tested with x = torch.randn(x.size(0), 128) on a V100)

x.size(0)	int32 idx	int64 idx	slowdown
50000	-	4.4460	-
25000	1.02522	1.10869	7.53%
12500	0.25182	0.27277	7.68%
6250	0.06291	0.06817	7.72%
3125	0.01573	0.01704	7.69%
1562	0.00393	0.00426	7.75%

While checking the backward kernel, it seems I'm triggering another error with a size limit of

x = torch.randn(1449, 1, device='cuda', requires_grad=True)
out = torch.pdist(x)
out.mean().backward()
> RuntimeError: CUDA error: invalid configuration argument

, while [<=1448, 1] works.

I'll take another look at this issue. Let me know, if the potential fix should go into this PR or if I should open a new issue.

CC @ngimel, @csarofeen

kostmo · 2019-12-24T07:39:53Z

💊 CircleCI build failures summary and remediations

As of commit 1fc79be:

1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 09 04:08:57 caused by: Connection refused (os error 111)

Feb 09 04:08:57 +++ eval 'extract_trap_cmd ' 
Feb 09 04:08:57 ++++ extract_trap_cmd 
Feb 09 04:08:57 ++++ printf '%s\n' '' 
Feb 09 04:08:57 +++ printf '%s\n' cleanup 
Feb 09 04:08:57 ++ trap -- ' 
Feb 09 04:08:57 cleanup' EXIT 
Feb 09 04:08:57 ++ which sccache 
Feb 09 04:08:57 ++ sccache --stop-server 
Feb 09 04:08:57 Stopping sccache server... 
Feb 09 04:08:57 error: couldn't connect to server 
Feb 09 04:08:57 caused by: Connection refused (os error 111) 
Feb 09 04:08:57 ++ true 
Feb 09 04:08:57 ++ rm /var/lib/jenkins/sccache_error.log 
Feb 09 04:08:57 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Feb 09 04:08:57 ++ SCCACHE_IDLE_TIMEOUT=1200 
Feb 09 04:08:57 ++ RUST_LOG=sccache::server=error 
Feb 09 04:08:57 ++ sccache --start-server 
Feb 09 04:08:57 Starting sccache server... 
Feb 09 04:08:58 ++ sccache --zero-stats 
Feb 09 04:08:58 Compile requests                 0 
Feb 09 04:08:58 Compile requests executed        0

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 52 times.

ptrblck · 2019-12-24T07:47:51Z

It seems the backward error comes from a wrong CUDA launch config triggered by grid_y >= 65535 (for an input tensor of [1449, 1] it will be 65568, which is invalid according to the programming guide) in this line of code.

A similar issue is thrown for cdist (but for another size, need to triage this bug also).

I'll open a new issue and try to fix the launch configs there.

EDIT: might also be related to #24345

EDIT2: PR for cdist #31167 which might be also applicable for pdist.

ngimel · 2019-12-24T18:47:37Z

@ptrblck 8% perf regression is acceptable. Please fix backward in this PR also. We can leave cdist separate for now.

ngimel · 2020-01-07T00:58:41Z

And add tests please. I don't understand how just swapping the grid and block sizes would work without corresponding changes in the kernel itself.

ptrblck · 2020-01-07T04:47:17Z

I've added tests for the failed use cases and kept the sizes "reasonably" small, but let me know, if I should expand the test cases.

Benchmarking for the backward pass on V100-SXM2 32GB (time in ms/iter):

Using input = torch.randn(size, 128, device='cuda'):

x.size(0)	before PR	after PR
5600	-	78.8309
2800	-	19.9385
1400	6.0119	5.0028
700	1.5210	1.2706
350	0.4104	0.3429
175	0.1673	0.1648

Using input = torch.randn(size, 1, device='cuda'):

x.size(0)	before PR	after PR
50000	-	4541.2841
25000	-	1133.9128
12500	-	283.3532
6250	-	70.8232
3125	-	17.7395
1562	-	4.4551
781	1.3270	1.1294
390	0.3595	0.3087
195	0.1651	0.2168

Code used for benchmarking:

import torch
import torch.nn as nn
import time

nb_iters = 10
sizes = [int(50000/2**i) for i in range(10)]

for size in sizes:
    x = torch.randn(size, 1, device='cuda', requires_grad=True)
    # warmup
    for _ in range(nb_iters):
        out = torch.pdist(x)
        out.mean().backward()
    #print(torch.cuda.memory_allocated()/1024**3)

    torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(nb_iters):
        out = torch.pdist(x)
        out.mean().backward()
    torch.cuda.synchronize()
    t1 = time.time()
    print('size {}, time {:.4f}ms/iter'.format(size, 1000*(t1 - t0)/nb_iters))

ngimel · 2020-01-07T23:41:59Z

Test failure are real
tests are smoke tests not testing correctness
and this is concerning, because I don't see how just flipping grid dimensions without changing the actual kernels can work.

ptrblck · 2020-01-07T23:49:32Z

Sorry for the confusion, but kernel changes weren't in the commit. 😕
I'll add correctness tests and commit the kernel changes.

ptrblck · 2020-01-08T02:35:25Z

The forward test for [50000, 1] uses approx. 23GB when comparing with brute_pdist.
Just the smoke test without the comparison ~4.66GB.
Should we fall back to the smoke test for this shape? I could decorate this test with LARGE_TENSOR and try to run it on our CI.

I've added the gradient check for the other, smaller shapes as well.
Let me know, if I should remove them and just keep it for backward pass for [1500, 1].

ptrblck · 2020-01-24T03:53:55Z

@pytorchbot retest this please

ngimel · 2020-01-24T16:56:18Z

at least rocm failure is real, have not looked at other ones.

ptrblck · 2020-01-27T17:26:52Z

@pytorchbot retest this please

ngimel · 2020-02-03T23:45:04Z

@ptrblck can you please split pdist test into 2, one testing forward, and another testing backward that would be completely skipped on ROCm using @skipIfRocm decorator? We try to deprecate TEST_WITH_ROCM thing.

ptrblck · 2020-02-04T00:59:01Z

@ngimel Sure, I'll split the tests and skip the backward for Rocm.
However, Rocm now also seems to fail in the forward pass for a tensor of [50000, 1] (initial commit to fix the forward pass size limitation by using const int64_t k = blockIdx.x;) with

00:38:18 ======================================================================
00:38:18 FAIL: test_pdist_norm_cuda (__main__.TestTorchDeviceTypeCUDA)
00:38:18 ----------------------------------------------------------------------
00:38:18 Traceback (most recent call last):
00:38:18   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 681, in wrapper
00:38:18     method(*args, **kwargs)
00:38:18   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 180, in instantiated_test
00:38:18     return test(self, device_arg)
00:38:18   File "test_torch.py", line 11002, in test_pdist_norm
00:38:18     self.assertTrue(torch.allclose(expected_cpu, actual_gpu.cpu()))
00:38:18 AssertionError: False is not true

This test checks the GPU result for p=2 with the CPU result.

Should I skip this test as well for rocm or wait for a review?

ngimel · 2020-02-06T16:46:04Z

@ptrblck please skip the forward test also and file an issue for ROCm

ngimel

The things I'd like to see in the test

no TEST_WITH_ROCM, only decorators
separate test for the large size you are adding
disable ROCm tests as needed (as long as you are not breaking anything, and those are the tests that you are adding), but file an issue for ROCm detailing the failures.

ptrblck · 2020-02-09T03:17:53Z

I've removed the TEST_WITH_ROCM usage and added the decorators.
test_pdist_norm_forward is applied for all devices, while test_pdist_norm_backward and test_pdist_norm_large is skipped for rocm.
I've also moved pdist_single to common_utils (where brute_pdist is also located) to avoid code duplication.

I'll create a new issue with the information about the failing rocm tests.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-02-11T21:23:11Z

@ngimel merged this pull request in a64d0ff.

…ytorch#31593) Summary: Currently `torch.pdist` yields an illegal CUDA memory access for batch sizes >= 46342 as reported by SsnL in pytorch#30583. Thanks for the minimal code reproduction, btw! ;) Reason for this bug: The calculation if `i` in the [`pdist_kerne_cuda_impl`](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/cuda/DistanceKernel.cu#L112) might overflow, if a tensor with a `batch size >= 46342` is passed to `torch.pdist`. Detailed description: * `result` is resizes as ` n * (n - 1) / 2 = 1073767311` ([line of code](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/Distance.cpp#L140)) * `grid` is initialized as `result.numel()` ([line of code](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/cuda/DistanceKernel.cu#L246)) * `k` is assigned to the `blockIdx.x` as an `int32` ([line of code](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/cuda/DistanceKernel.cu#L108)) * `i` is calculated using `2 * k >= 2147534622` ([line of code](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/cuda/DistanceKernel.cu#L112)), which overflows, since `2147534622 > 2147483647 (int32_max)`. Using `const int64_t k = blockIdx.x;` would solve the illegal memory access. This seems also be done for [`cdist_kernel_cuda_impl`](https://github.com/pytorch/pytorch/blob/46ad80c8395379be5ba17624fd5dbad8e7a8e8d2/aten/src/ATen/native/cuda/DistanceKernel.cu#L198-L201). However, we might expect a slowdown, so I've timed the current PyTorch master vs. this PR: (tested with `x = torch.randn(x.size(0), 128)` on a V100) |x.size(0) | int32 idx | int64 idx | slowdown | |----------|-----------|-----------|----------| | 50000 | - | 4.4460 | - | | 25000 | 1.02522 | 1.10869 | 7.53% | | 12500 | 0.25182 | 0.27277 | 7.68% | | 6250 | 0.06291 | 0.06817 | 7.72% | | 3125 | 0.01573 | 0.01704 | 7.69% | | 1562 | 0.00393 | 0.00426 | 7.75% | While checking the backward kernel, it seems I'm triggering another error with a size limit of ```python x = torch.randn(1449, 1, device='cuda', requires_grad=True) out = torch.pdist(x) out.mean().backward() > RuntimeError: CUDA error: invalid configuration argument ``` , while `[<=1448, 1]` works. I'll take another look at this issue. Let me know, if the potential fix should go into this PR or if I should open a new issue. CC ngimel, csarofeen Pull Request resolved: pytorch#31593 Differential Revision: D19825571 Pulled By: ngimel fbshipit-source-id: ace9ccab49f3cf0ce894cdb6daef0795e2e8ec03

initial commit for pytorch#30583

43cd34b

ptrblck mentioned this pull request Dec 24, 2019

Skip manual backward for cdist with case p=2 #31167

Closed

pytorchbot added the open source label Jan 6, 2020

swap grid_x and grid_y in pdist_backward

dd6e4b2

add test

b896801

Zhaoyi-Yan mentioned this pull request Jan 7, 2020

Some notes on the comment in kernel.py cornellius-gp/gpytorch#1010

Closed

ptrblck added 2 commits January 8, 2020 01:49

add pdist_kernel changes, add tests

96320a5

fix lint error

f3f10a3

ptrblck changed the title ~~Use int64 in pdist kernel to handle batches >= 46342 #30583~~ [WIP] Use int64 in pdist kernel to handle batches >= 46342 #30583 Jan 8, 2020

ptrblck added 2 commits January 8, 2020 06:43

fix same issue as initial forward error

e1aa756

change test for big dim0

4199bd5

ptrblck changed the title ~~[WIP] Use int64 in pdist kernel to handle batches >= 46342 #30583~~ Use int64 in pdist kernel to handle batches >= 46342 #30583 Jan 8, 2020

zou3519 requested a review from ngimel January 9, 2020 19:10

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 9, 2020

Rebase

4e26ce0

debug: add print statement to test

82d4ce7

ptrblck mentioned this pull request Feb 2, 2020

DO NOT MERGE: try grad_check for pdist #32758

Closed

skip backward test for rocm

7cba1d0

ngimel requested changes Feb 6, 2020

View reviewed changes

ptrblck and others added 2 commits February 9, 2020 03:01

split pdist test to fwd and bwd, skip failing rocm test

ab53de6

Merge branch 'master' into pdist_fix

1fc79be

ptrblck mentioned this pull request Feb 9, 2020

[ROCm] fails on pdist tests #33128

Closed

ngimel approved these changes Feb 11, 2020

View reviewed changes

facebook-github-bot reviewed Feb 11, 2020

View reviewed changes

facebook-github-bot closed this in a64d0ff Feb 11, 2020

ssnl mentioned this pull request Feb 11, 2020

pdist can't handle batch >= 46342 #30583

Closed

facebook-github-bot added the merged label Feb 11, 2020

ailzhang mentioned this pull request Feb 11, 2020

Skip two new tests. pytorch/xla#1620

Merged

ptrblck mentioned this pull request Apr 16, 2020

[Distance functions] F.pdist backward CUDA invalid configuration #25045

Open

mruberry added the Merged label Oct 28, 2020

wanyu2018umac mentioned this pull request Dec 29, 2020

CUDA error: invalid configuration argument during backward through torch.cdist #49928

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use int64 in pdist kernel to handle batches >= 46342 #30583 #31593

Use int64 in pdist kernel to handle batches >= 46342 #30583 #31593

ptrblck commented Dec 24, 2019

kostmo commented Dec 24, 2019 •

edited by dr-ci bot

Loading

ptrblck commented Dec 24, 2019 •

edited

Loading

ngimel commented Dec 24, 2019

ngimel commented Jan 7, 2020 •

edited

Loading

ptrblck commented Jan 7, 2020

ngimel commented Jan 7, 2020

ptrblck commented Jan 7, 2020

ptrblck commented Jan 8, 2020

ptrblck commented Jan 24, 2020

ngimel commented Jan 24, 2020

ptrblck commented Jan 27, 2020

ngimel commented Feb 3, 2020

ptrblck commented Feb 4, 2020

ngimel commented Feb 6, 2020

ngimel left a comment •

edited

Loading

ptrblck commented Feb 9, 2020

facebook-github-bot left a comment

facebook-github-bot commented Feb 11, 2020

Use int64 in pdist kernel to handle batches >= 46342 #30583 #31593

Use int64 in pdist kernel to handle batches >= 46342 #30583 #31593

Conversation

ptrblck commented Dec 24, 2019

kostmo commented Dec 24, 2019 • edited by dr-ci bot Loading

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

ptrblck commented Dec 24, 2019 • edited Loading

ngimel commented Dec 24, 2019

ngimel commented Jan 7, 2020 • edited Loading

ptrblck commented Jan 7, 2020

ngimel commented Jan 7, 2020

ptrblck commented Jan 7, 2020

ptrblck commented Jan 8, 2020

ptrblck commented Jan 24, 2020

ngimel commented Jan 24, 2020

ptrblck commented Jan 27, 2020

ngimel commented Feb 3, 2020

ptrblck commented Feb 4, 2020

ngimel commented Feb 6, 2020

ngimel left a comment • edited Loading

Choose a reason for hiding this comment

ptrblck commented Feb 9, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 11, 2020

kostmo commented Dec 24, 2019 •

edited by dr-ci bot

Loading

ptrblck commented Dec 24, 2019 •

edited

Loading

ngimel commented Jan 7, 2020 •

edited

Loading

ngimel left a comment •

edited

Loading