[torchbench][accuracy] demucs accuracy check failed #459

alexbaden · 2024-02-06T04:44:10Z

» benchmarks/dynamo/torchbench.py --float32 -dxpu -n10 --no-skip --dashboard --training --inductor --accuracy --output /tmp/torchbench.csv --filter demucs







loading model: 0it [00:05, ?it/s]
xpu  train demucs                             
WARNING:common:fp64 golden ref were not generated for demucs. Setting accuracy check to cosine
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
/localdisk/abaden/Projects/envs/triton-benchmark-env/lib/python3.10/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: _ZNK5torch8autograd4Node4nameEv
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
skipping cudagraphs for unknown reason
[2024-02-05 21:43:53,107] torch._dynamo.utils: [WARNING] Similarity score=0.00015774701023474336
fail_accuray

The text was updated successfully, but these errors were encountered:

whitneywhtsang · 2024-02-08T16:55:47Z

Also fail with v2.1.

ienkovich · 2024-02-13T04:49:35Z

demucs fails only in training and it happens due to random numbers usage in training. Training passes on CPU but fails on XPU in both eager and inductor modes. Looks like this happens because RNG state is not reset properly for XPU between model runs.

There are at least two places where torch.manual_seed is replaced with an implementation that is not XPU-enabled:
https://github.com/weishi-deng/benchmark/blob/main/torchbenchmark/util/env_check.py#L133
https://github.com/weishi-deng/benchmark/blob/main/userbenchmark/dynamo/dynamobench/common.py#L329
Fixing it lets demucs pass on XPU in eager mode.

Here is a patch I use

diff --git a/torchbenchmark/util/env_check.py b/torchbenchmark/util/env_check.py
index 956fdb4f..bef4924b 100644
--- a/torchbenchmark/util/env_check.py
+++ b/torchbenchmark/util/env_check.py
@@ -125,6 +125,8 @@ def set_random_seed():

         if not torch.cuda._is_in_bad_fork():
             torch.cuda.manual_seed_all(seed)
+        if hasattr(torch, 'xpu') and not torch.xpu._is_in_bad_fork():
+            torch.xpu.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed(MAIN_RANDOM_SEED)
diff --git a/userbenchmark/dynamo/dynamobench/common.py b/userbenchmark/dynamo/dynamobench/common.py
index 831dfe06..1bf14e96 100644
--- a/userbenchmark/dynamo/dynamobench/common.py
+++ b/userbenchmark/dynamo/dynamobench/common.py
@@ -320,10 +320,17 @@ def patch_torch_manual_seed():
         from torch._C import default_generator

         seed = 1337
-        import torch.cuda

-        if not torch.cuda._is_in_bad_fork():
-            torch.cuda.manual_seed_all(seed)
+        try:
+            import intel_extension_for_pytorch
+
+            if torch.xpu.is_available() and not torch.xpu._is_in_bad_fork():
+                torch.xpu.manual_seed_all(seed)
+        except:
+            import torch.cuda
+
+            if torch.cuda.is_available() and not torch.cuda._is_in_bad_fork():
+                torch.cuda.manual_seed_all(seed)
         return default_generator.manual_seed(seed)

     torch.manual_seed = deterministic_torch_manual_seed

ienkovich · 2024-02-13T18:02:16Z

When we run the benchmark with Inductor, random number generation goes through Triton kernel, and the seed used for this kernel is generated using aten.randint operation. So, I believe eager mode and inductor would always produce different random number sequences because they would use different seeds and generation algorithms. It doesn't look like different backends have to be aligned in random number generation, so their results are simply incomparable.

Inductor kernel where the seed is generated and put to buf0 passed later to the Triton kernel:

def call(args):
    buf0 = empty_strided((1, ), (1, ), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    aten.randint.low_out(-9223372036854775808, 9223372036854775807, [1], out=buf0)
    buf1 = empty_strided((4, 4, 1, 1), (4, 1, 1, 1), device='xpu', dtype=torch.int64)
    # Source Nodes: [], Original ATen: []
    stream0 = get_xpu_stream(0)
    triton_poi_fused_0.run(buf0, buf1, 0, 16, grid=grid(16), stream=stream0)
    return (buf1, )

Triton kernel where the seed (tmp0) is used for randint64:

def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + load_seed_offset)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 44100)
    tl.store(out_ptr0 + (x0), tmp2, xmask)

vlad-penkin · 2024-06-05T20:36:34Z

This issue is still reproduceable.

Env:

pytorch is built from source, top of the main trunk, commit_id - 9a8ab778d34bd24c5caceb340837483decc4c311
triton xpu is built from source, top of the main trunk, commit_id - fe93a00ffe438e9ba8c8392c0b051b1662c810de
benchmark is built from source, top of the main trunk, commit_id - d54ca9f80ead108c8797441681e219becaf963d8
torchaudio is built from source, top of the main trunk, commit_id - 1980f8af5bcd0bb2ce51965cf79d8d4c25dad8a0
torchvision is built from source, top of the main trunk, commit_id - 10239873229e527f8b7e7b3340c40ee38bb1cfc4
PyTorch Dependency Bundle 0.5.0
Latest Rolling Driver

alexbaden mentioned this issue Feb 6, 2024

[Accuracy] Summary for torchbench models failed in Inductor accuracy check #438

Closed

vlad-penkin added bug Something isn't working tests: e2e labels Feb 6, 2024

vlad-penkin added the accuracy label Feb 9, 2024

vlad-penkin changed the title ~~[torchbench] demucs accuracy check failed~~ [torchbench][accuracy] demucs accuracy check failed Feb 9, 2024

vlad-penkin added this to the E2E pass rate milestone Feb 9, 2024

vlad-penkin assigned riverliuintel Feb 14, 2024

vlad-penkin added upstream: pytorch dependencies labels Feb 14, 2024

vlad-penkin modified the milestones: 03. E2E pass rate, 03.2 [E2E] Test proxies (PT 2.1 / IPEX 2.1 / PT Benchmark) Mar 6, 2024

vlad-penkin modified the milestones: 3.2 [E2E] Test proxies (PT 2.1 / IPEX 2.1 / PT Benchmark), 3.0 [E2E] Pass rate Jun 5, 2024

vlad-penkin modified the milestones: 3.0 [E2E] Pass rate - PT 2.4, 3.0 [E2E] Pass rate - PT 2.5 Aug 2, 2024

vlad-penkin added dependencies: pytorch and removed dependencies labels Aug 18, 2024

vlad-penkin unassigned riverliuintel Sep 13, 2024

vlad-penkin modified the milestones: 3.0 [PT 2.5 E2E] Pass rate, 3.0 [PT 2.6 E2E] Pass rate Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench][accuracy] demucs accuracy check failed #459

[torchbench][accuracy] demucs accuracy check failed #459

alexbaden commented Feb 6, 2024

whitneywhtsang commented Feb 8, 2024

ienkovich commented Feb 13, 2024 •

edited

Loading

ienkovich commented Feb 13, 2024

vlad-penkin commented Jun 5, 2024

[torchbench][accuracy] demucs accuracy check failed #459

[torchbench][accuracy] demucs accuracy check failed #459

Comments

alexbaden commented Feb 6, 2024

whitneywhtsang commented Feb 8, 2024

ienkovich commented Feb 13, 2024 • edited Loading

ienkovich commented Feb 13, 2024

vlad-penkin commented Jun 5, 2024

ienkovich commented Feb 13, 2024 •

edited

Loading