Skip to content

API Validation failures #467

Open
Open

Description

I think we should run all API Validation tests and allow them to fail instead of only running a subset so that we are aware of the failures. Once we've dealt with them we can make it mandatory.

Currently failing tests:

  • gpuarrays/linalg/mul!/vector-matrix
  • mps/linalg (heisenbug, see local example)
  • mps/copy
Local example:
(Metal) pkg> test
     Testing Metal
...
     Testing Running tests...
2024-10-18 16:29:58.279 julia[36961:444153] Metal API Validation Enabled
2024-10-18 16:29:58.279 julia[36961:444153] Metal GPU Validation Enabled
┌ Info: System information:
│ macOS 15.0.1, Darwin 24.0.0
│ 
│ Toolchain:
│ - Julia: 1.11.1
│ - LLVM: 16.0.6
│ 
│ Julia packages: 
│ - Metal.jl: 1.4.0
│ - GPUArrays: 11.0.0
│ - GPUCompiler: 1.0.0
│ - KernelAbstractions: 0.9.28
│ - ObjectiveC: 3.1.0
│ - LLVM: 9.1.2
│ - LLVMDowngrader_jll: 0.3.0+1
│ 
│ Environment:
│ - MTL_SHADER_VALIDATION: 1
│ - MTL_DEBUG_LAYER: 1
│ 
│ 1 device:
└ - Apple M2 Max (192.000 KiB allocated)
[ Info: Running 8 tests in parallel. If this is too many, specify the `--jobs` argument to the tests, or set the JULIA_CPU_THREADS environment variable.
      From worker 7:    2024-10-18 16:30:06.803 julia[36969:444278] Metal API Validation Enabled
      From worker 7:    2024-10-18 16:30:06.803 julia[36969:444278] Metal GPU Validation Enabled
      From worker 4:    2024-10-18 16:30:06.814 julia[36966:444275] Metal API Validation Enabled
      From worker 4:    2024-10-18 16:30:06.814 julia[36966:444275] Metal GPU Validation Enabled
      From worker 9:    2024-10-18 16:30:06.817 julia[36971:444280] Metal API Validation Enabled
      From worker 9:    2024-10-18 16:30:06.817 julia[36971:444280] Metal GPU Validation Enabled
      From worker 3:    2024-10-18 16:30:06.828 julia[36965:444274] Metal API Validation Enabled
      From worker 3:    2024-10-18 16:30:06.829 julia[36965:444274] Metal GPU Validation Enabled
      From worker 8:    2024-10-18 16:30:06.831 julia[36970:444279] Metal API Validation Enabled
      From worker 8:    2024-10-18 16:30:06.831 julia[36970:444279] Metal GPU Validation Enabled
      From worker 6:    2024-10-18 16:30:06.842 julia[36968:444277] Metal API Validation Enabled
      From worker 6:    2024-10-18 16:30:06.843 julia[36968:444277] Metal GPU Validation Enabled
      From worker 2:    2024-10-18 16:30:06.843 julia[36964:444270] Metal API Validation Enabled
      From worker 2:    2024-10-18 16:30:06.843 julia[36964:444270] Metal GPU Validation Enabled
      From worker 5:    2024-10-18 16:30:06.863 julia[36967:444276] Metal API Validation Enabled
      From worker 5:    2024-10-18 16:30:06.864 julia[36967:444276] Metal GPU Validation Enabled
                                                  |          | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
metallib                                      (8) |     0.68 |   0.01 |  1.9 |     209.75 |   573.78 |
pool                                          (9) |     1.12 |   0.03 |  2.5 |     320.76 |   596.08 |
      From worker 10:   2024-10-18 16:30:14.310 julia[36977:444462] Metal API Validation Enabled
      From worker 10:   2024-10-18 16:30:14.311 julia[36977:444462] Metal GPU Validation Enabled
      From worker 8:    Starting recording with the Blank template and GPU, Time Profiler, Metal Application, Metal GPU Counters, Metal Resource Events, os_signpost Instruments. Attaching to: julia (36970). 
      From worker 8:    Ctrl-C to stop the recording
      From worker 8:    Stopping recording...
metal                                         (7) |     4.18 |   0.12 |  2.8 |     528.20 |   712.36 |
      From worker 7:    ┌ Warning: Skipping script tests
      From worker 7:    └ @ Main ~/.julia/dev/Metal/test/scripts.jl:9
scripts                                       (7) |     0.86 |   0.00 |  0.0 |      76.59 |   716.12 |
      From worker 8:    Recording completed. Saving output file...
      From worker 8:    Output file saved as: julia_1.trace
      From worker 8:    [ Info: System trace saved to /private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_6ZIMtu/julia_1.trace; open the resulting trace in Instruments
profiling                                     (8) |     6.73 |   0.00 |  0.0 |      99.23 |   593.14 |
      From worker 10:   ┌ Warning: Skipping capturing tests; capturing is not supported with Metal Shader Validation enabled
      From worker 10:   └ @ Main ~/.julia/dev/Metal/test/capturing.jl:4
capturing                                    (10) |     0.82 |   0.00 |  0.0 |      85.42 |   560.77 |
      From worker 11:   2024-10-18 16:30:24.248 julia[37027:445348] Metal API Validation Enabled
      From worker 11:   2024-10-18 16:30:24.248 julia[37027:445348] Metal GPU Validation Enabled
execution                                     (5) |    16.66 |   0.25 |  1.5 |    1773.28 |   793.55 |
mps/matrix                                    (5) |     0.37 |   0.00 |  0.0 |      52.49 |   798.92 |
mps/size                                      (5) |     0.04 |   0.00 |  0.0 |       1.41 |   799.62 |
mps/vector                                    (5) |     0.14 |   0.00 |  0.0 |      19.17 |   800.42 |
examples                                      (4) |    25.74 |   0.64 |  2.5 |    2717.08 |  2026.69 |
gpuarrays/indexing scalar                     (5) |     9.58 |   0.11 |  1.2 |    1401.12 |   881.42 |
kernelabstractions                            (6) |    30.00 |   0.56 |  1.9 |    3955.16 |  1033.52 |
random                                        (9) |    31.51 |   0.49 |  1.5 |    3735.46 |   990.28 |
device/intrinsics                             (7) |    36.95 |   0.47 |  1.3 |    4235.62 |  1026.02 |
      From worker 11:
      From worker 11:   [37027] signal 10 (1): Bus error: 10
      From worker 11:   in expression starting at /Users/christian/.julia/dev/Metal/test/mps/linalg.jl:3
      From worker 11:   objc_msgSend at /usr/lib/libobjc.A.dylib (unknown line)
      From worker 11:   _ZN24resolvedSharedPacketDataI23GPUDebugBadAccessPacketEC2ERKS0_15MTLFunctionTypeP24MTLGPUDebugCommandBufferP17MTLGPUDebugGPULog at /System/Library/PrivateFrameworks/MetalTools.framework/Versions/A/MetalTools (unknown line)
      From worker 11:   Allocations: 85721828 (Pool: 85719313; Big: 2515); GC: 47
mps/linalg                                   (11) |         failed at 2024-10-18T16:30:59.210
Worker 11 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:970
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:978
 [3] unsafe_read
   @ ./io.jl:891 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:890
 [5] read!
   @ ./io.jl:895 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:121
      From worker 12:   2024-10-18 16:31:03.011 julia[37573:446632] Metal API Validation Enabled
      From worker 12:   2024-10-18 16:31:03.011 julia[37573:446632] Metal GPU Validation Enabled
gpuarrays/math/power                          (6) |    26.93 |   0.53 |  2.0 |    4850.07 |  1274.00 |
array                                         (2) |    64.57 |   1.23 |  1.9 |    8220.88 |  1769.12 |
gpuarrays/indexing find                       (7) |    23.17 |   0.57 |  2.4 |    5480.38 |  1208.73 |
gpuarrays/linalg/mul!/vector-matrix           (9) |         failed at 2024-10-18T16:31:20.021
gpuarrays/reductions/any all count            (6) |    11.26 |   0.13 |  1.1 |    1729.80 |  1403.30 |
      From worker 13:   2024-10-18 16:31:24.098 julia[37969:447437] Metal API Validation Enabled
      From worker 13:   2024-10-18 16:31:24.098 julia[37969:447437] Metal GPU Validation Enabled
gpuarrays/uniformscaling                      (7) |     7.15 |   0.04 |  0.6 |     635.54 |  1348.91 |
gpuarrays/math/intrinsics                     (7) |     3.70 |   0.03 |  0.7 |     374.92 |  1410.30 |
mps/copy                                      (8) |         failed at 2024-10-18T16:31:34.072
      From worker 14:   2024-10-18 16:31:37.928 julia[38219:447992] Metal API Validation Enabled
      From worker 14:   2024-10-18 16:31:37.928 julia[38219:447992] Metal GPU Validation Enabled
gpuarrays/indexing multidimensional          (12) |    52.60 |   0.74 |  1.4 |    6553.91 |  1061.39 |
gpuarrays/reductions/reducedim!               (4) |    85.07 |   1.29 |  1.5 |   11756.37 |  2308.91 |
gpuarrays/linalg/norm                         (7) |    38.33 |   0.49 |  1.3 |    5759.17 |  1591.80 |
gpuarrays/vectors                             (7) |     0.17 |   0.00 |  0.0 |      22.95 |  1593.03 |
gpuarrays/linalg/mul!/matrix-matrix           (6) |    56.91 |   0.43 |  0.8 |    5222.27 |  1547.66 |
gpuarrays/random                              (7) |    12.45 |   0.08 |  0.6 |    1200.88 |  1678.14 |
gpuarrays/linalg                              (5) |   104.86 |   1.66 |  1.6 |   14532.96 |  1559.11 |
gpuarrays/reductions/mapreducedim!_large     (13) |    57.07 |   1.34 |  2.3 |    8654.05 |  1452.14 |
gpuarrays/constructors                        (4) |    22.09 |   0.19 |  0.9 |    2061.44 |  2376.53 |
gpuarrays/statistics                         (14) |    48.62 |   0.70 |  1.4 |    5975.28 |   955.05 |
gpuarrays/base                                (6) |    25.21 |   0.58 |  2.3 |    4725.99 |  1839.33 |
gpuarrays/reductions/== isequal               (7) |    43.42 |   0.54 |  1.2 |    6198.30 |  2041.39 |
gpuarrays/reductions/reduce                   (4) |    61.15 |   1.22 |  2.0 |   11113.19 |  2376.53 |
gpuarrays/reductions/minimum maximum extrema  (2) |   140.04 |   2.39 |  1.7 |   21722.48 |  2168.75 |
gpuarrays/reductions/mapreduce               (12) |   114.96 |   1.89 |  1.6 |   17942.78 |  1959.66 |
gpuarrays/reductions/mapreducedim!           (13) |   104.12 |   1.57 |  1.5 |   14456.61 |  2160.31 |
gpuarrays/reductions/sum prod                (14) |   109.30 |   1.71 |  1.6 |   16192.86 |  2012.47 |
gpuarrays/broadcasting                        (5) |   152.22 |   2.05 |  1.3 |   19808.61 |  2611.33 |
Testing finished in 4 minutes, 47 seconds, 973 milliseconds
mps/linalg: Error During Test at none:1
  Got exception outside of a @test
  ProcessExitedException(11)
Worker 9 failed running test gpuarrays/linalg/mul!/vector-matrix:
Some tests did not pass: 139 passed, 1 failed, 0 errored, 0 broken.
gpuarrays/linalg/mul!/vector-matrix: Test Failed at /Users/christian/.julia/dev/GPUArrays/test/testsuite/linalg.jl:315
  Expression: compare(*, AT, f(A), x)

Stacktrace:
 [1] backtrace()
   @ Base ./error.jl:114
 [2] record(ts::Test.DefaultTestSet, t::Union{Test.Error, Test.Fail}; print_result::Bool)
   @ Test ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Test/src/Test.jl:1107
 [3] record(ts::Test.DefaultTestSet, t::Union{Test.Error, Test.Fail})
   @ Test ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Test/src/Test.jl:1100
 [4] top-level scope
   @ ~/.julia/dev/Metal/test/runtests.jl:379
 [5] include(fname::String)
   @ Main ./sysimg.jl:38
 [6] top-level scope
   @ none:6
 [7] eval
   @ ./boot.jl:430 [inlined]
 [8] exec_options(opts::Base.JLOptions)
   @ Base ./client.jl:296
 [9] _start()
   @ Base ./client.jl:531
Worker 8 failed running test mps/copy:
Some tests did not pass: 143 passed, 1 failed, 0 errored, 64 broken.
mps/copy: Test Failed at /Users/christian/.julia/dev/Metal/test/mps/copy.jl:46
  Expression: dstMat == srcMat
   Evaluated: Int8[-7 -37 … -28 -9; -23 -38 … -89 -106; … ; 77 12 … 71 116; -92 -6 … -103 -51] == Int8[-7 -37 … -28 -9; -23 -38 … -89 -106; … ; 77 12 … 71 116; -92 -6 … -103 -51]

Stacktrace:
 [1] record(ts::Test.DefaultTestSet, t::Union{Test.Error, Test.Fail}; print_result::Bool)
   @ Test ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Test/src/Test.jl:1107
 [2] record(ts::Test.DefaultTestSet, t::Union{Test.Error, Test.Fail})
   @ Test ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Test/src/Test.jl:1100
 [3] top-level scope
   @ ~/.julia/dev/Metal/test/runtests.jl:379
 [4] include(fname::String)
   @ Main ./sysimg.jl:38
 [5] top-level scope
   @ none:6
 [6] eval
   @ ./boot.jl:430 [inlined]
 [7] exec_options(opts::Base.JLOptions)
   @ Base ./client.jl:296
 [8] _start()
   @ Base ./client.jl:531

Test Summary:                                  | Pass  Fail  Error  Broken  Total  Time
  Overall                                      | 9688     2      1     104   9795      
    metallib                                   |   25                          25      
    pool                                       |    5                           5      
    metal                                      |  128                         128      
    scripts                                    |                                0      
    profiling                                  |    1                           1      
    capturing                                  |                                0      
    execution                                  |   37                          37      
    mps/matrix                                 |   76                          76      
    mps/size                                   |    9                           9      
    mps/vector                                 |   34                          34      
    examples                                   |    4                           4      
    gpuarrays/indexing scalar                  |  399                         399      
    kernelabstractions                         | 2179                    8   2187      
    random                                     |  818                         818      
    device/intrinsics                          |  129                         129      
    mps/linalg                                 |                 1              1      
    gpuarrays/math/power                       |   60                          60      
    array                                      |  409                   32    441      
    gpuarrays/indexing find                    |   45                          45      
    gpuarrays/linalg/mul!/vector-matrix        |  139     1                   140      
    gpuarrays/reductions/any all count         |  101                         101      
    gpuarrays/uniformscaling                   |   56                          56      
    gpuarrays/math/intrinsics                  |   10                          10      
    mps/copy                                   |  143     1             64    208      
    gpuarrays/indexing multidimensional        |   89                          89      
    gpuarrays/reductions/reducedim!            |  160                         160      
    gpuarrays/linalg/norm                      |  264                         264      
    gpuarrays/vectors                          |   10                          10      
    gpuarrays/linalg/mul!/matrix-matrix        |  360                         360      
    gpuarrays/random                           |   52                          52      
    gpuarrays/linalg                           |  397                         397      
    gpuarrays/reductions/mapreducedim!_large   |   40                          40      
    gpuarrays/constructors                     |  832                         832      
    gpuarrays/statistics                       |   52                          52      
    gpuarrays/base                             |   95                          95      
    gpuarrays/reductions/== isequal            |  230                         230      
    gpuarrays/reductions/reduce                |  220                         220      
    gpuarrays/reductions/minimum maximum extrema |  555                         555      
    gpuarrays/reductions/mapreduce             |  330                         330      
    gpuarrays/reductions/mapreducedim!         |  260                         260      
    gpuarrays/reductions/sum prod              |  636                         636      
    gpuarrays/broadcasting                     |  299                         299      
    FAILURE

Error in testset mps/linalg:
Error During Test at none:1
  Got exception outside of a @test
  ProcessExitedException(11)
Error in testset gpuarrays/linalg/mul!/vector-matrix:
Test Failed at /Users/christian/.julia/dev/GPUArrays/test/testsuite/linalg.jl:315
  Expression: compare(*, AT, f(A), x)

Error in testset mps/copy:
Test Failed at /Users/christian/.julia/dev/Metal/test/mps/copy.jl:46
  Expression: dstMat == srcMat
   Evaluated: Int8[-7 -37 … -28 -9; -23 -38 … -89 -106; … ; 77 12 … 71 116; -92 -6 … -103 -51] == Int8[-7 -37 … -28 -9; -23 -38 … -89 -106; … ; 77 12 … 71 116; -92 -6 … -103 -51]

ERROR: LoadError: Test run finished with errors
in expression starting at /Users/christian/.julia/dev/Metal/test/runtests.jl:410
ERROR: Package Metal errored during testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions