Skip to content

Conversation

yuehhua
Copy link
Contributor

@yuehhua yuehhua commented Aug 27, 2021

Closes #1113

@maleadt maleadt added the cuda libraries Stuff about CUDA library wrappers. label Aug 27, 2021
@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #1120 (8af2171) into tb/sanitize (5ce89c3) will increase coverage by 11.78%.
The diff coverage is 85.00%.

❗ Current head 8af2171 differs from pull request most recent head d9d9cb0. Consider uploading reports for the commit d9d9cb0 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           tb/sanitize    #1120       +/-   ##
================================================
+ Coverage        66.64%   78.42%   +11.78%     
================================================
  Files              118      119        +1     
  Lines             7668     8110      +442     
================================================
+ Hits              5110     6360     +1250     
+ Misses            2558     1750      -808     
Impacted Files Coverage Δ
lib/cusparse/CUSPARSE.jl 80.95% <ø> (ø)
lib/cusparse/array.jl 63.90% <25.00%> (+51.08%) ⬆️
lib/cusparse/interfaces.jl 79.06% <83.01%> (+79.06%) ⬆️
lib/cusparse/extra.jl 100.00% <100.00%> (ø)
examples/wmma/high-level.jl 11.11% <0.00%> (-88.89%) ⬇️
examples/wmma/low-level.jl 14.28% <0.00%> (-85.72%) ⬇️
examples/vadd.jl 25.00% <0.00%> (-75.00%) ⬇️
examples/peakflops.jl 66.66% <0.00%> (-33.34%) ⬇️
src/indexing.jl 65.47% <0.00%> (-32.74%) ⬇️
src/reverse.jl 69.49% <0.00%> (-30.51%) ⬇️
... and 78 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ce89c3...d9d9cb0. Read the comment docs.

@yuehhua yuehhua marked this pull request as ready for review August 29, 2021 07:40
@yuehhua
Copy link
Contributor Author

yuehhua commented Aug 29, 2021

@maleadt Is the CI working?

@maleadt
Copy link
Member

maleadt commented Sep 6, 2021

Yes? Why wouldn't it be?

compute-sanitizer seems to fine an issue with some conversion:

�[37mcusparse/interfaces�[39m�[37m                   (2) |        started at 2021-08-29T08:49:39.240�[39m

      From worker 2:	========= Invalid __global__ write of size 4 bytes

      From worker 2:	=========     at 0x570 in void convert_CooToCsr_kernel<int=1>(int const *,int,int,int*)

      From worker 2:	=========     by thread (3,0,0) in block (0,0,0)

      From worker 2:	=========     Address 0x338f75c18 is out of bounds

      From worker 2:	========= 

      From worker 2:	========= Invalid __global__ write of size 4 bytes

      From worker 2:	=========     at 0x570 in void convert_CooToCsr_kernel<int=1>(int const *,int,int,int*)

      From worker 2:	=========     by thread (4,0,0) in block (0,0,0)

      From worker 2:	=========     Address 0x338f75c1c is out of bounds

      From worker 2:	========= 

@maleadt
Copy link
Member

maleadt commented Sep 6, 2021

I take it this also fixes #528?

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 6, 2021

Sure! It fixes #528.

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 7, 2021

It seems to fail on v1.6 (debug)?

@maleadt
Copy link
Member

maleadt commented Sep 7, 2021

It seems to fail on v1.6 (debug)?

Right, so there's something wrong with this PR or some of the kernels it calls. See the error log I posted.

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 8, 2021

It seems like error occurs while freeing buffer?

      From worker 2:	 [14] unsafe_free!(xs::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, stream::CuStream)
      From worker 2:	    @ CUDA /media/yuehhua/Workbench/workspace/CUDA.jl/src/array.jl:80
      From worker 2:	 [15] unsafe_free!(xs::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
      From worker 2:	    @ CUDA /media/yuehhua/Workbench/workspace/CUDA.jl/src/array.jl:72
      From worker 2:	 [16] with_workspace(f::CUDA.CUSPARSE.var"#1053#1055"{Float32, CuSparseMatrixCSR{Float32}, Float32, CuSparseMatrixCSR{Float32}, CuArray{Int32, 1, CUDA.Mem.DeviceBuffer}, CUDA.CUSPARSE.CuMatrixDescriptor, CUDA.CUSPARSE.CuMatrixDescriptor, CUDA.CUSPARSE.CuMatrixDescriptor, Int64, Int64}, eltyp::Type{UInt8}, size::CUDA.CUSPARSE.var"#bufferSize#1054"{Float32, CuSparseMatrixCSR{Float32}, Float32, CuSparseMatrixCSR{Float32}, CuArray{Int32, 1, CUDA.Mem.DeviceBuffer}, CUDA.CUSPARSE.CuMatrixDescriptor, CUDA.CUSPARSE.CuMatrixDescriptor, CUDA.CUSPARSE.CuMatrixDescriptor, Int64, Int64}, fallback::Nothing)
      From worker 2:	    @ CUDA.APIUtils /media/yuehhua/Workbench/workspace/CUDA.jl/lib/utils/call.jl:79
      From worker 2:	 [17] with_workspace (repeats 2 times)
      From worker 2:	    @ /media/yuehhua/Workbench/workspace/CUDA.jl/lib/utils/call.jl:53 [inlined]
      From worker 2:	 [18] geam(alpha::Float32, A::CuSparseMatrixCSR{Float32}, beta::Float32, B::CuSparseMatrixCSR{Float32}, index::Char)
      From worker 2:	    @ CUDA.CUSPARSE /media/yuehhua/Workbench/workspace/CUDA.jl/lib/cusparse/extra.jl:35

@maleadt
Copy link
Member

maleadt commented Sep 8, 2021

No, that's unrelated. The error code only triggers there because it's the first synchronous call. The real issue is spotted by compute-sanitizer, and is the error I linked above.

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 8, 2021

I still cannot find where convert_CooToCsr_kernel is called.

@maleadt
Copy link
Member

maleadt commented Sep 9, 2021

It's the name of a kernel, probably an internal one in CUSPARSE. We don't currently have host backtraces (disabled due to a compute-santizer bug), so can't report the Julia function that called the kernel.

Anyway, this is why I was looking into upgrading the CUDA toolkit used for these debug runs in #950, since maybe the issue is with CUDA (i.e. not us wrongly calling that kernel).

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 9, 2021

COO + CSR or CSR + COO will trigger the bug, so I temporally turn down the tests.

Comment on lines +139 to +142
function Base.:(+)(A::CuSparseMatrixCSR, B::CuSparseMatrix)
csrB = CuSparseMatrixCSR(B)
return geam(one(eltype(A)), A, one(eltype(A)), csrB, 'O')
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the CuSparseMatrixCSR(B) constructor and when B is a CuSparseMatrixCOO will trigger the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like occur in cusparseXcoo2csr.

@yuehhua
Copy link
Contributor Author

yuehhua commented Sep 10, 2021

It's ready for review.

@maleadt maleadt changed the base branch from master to tb/sanitize September 15, 2021 16:29
@maleadt
Copy link
Member

maleadt commented Sep 15, 2021

Rebased on top of the PR with CI fixes.

@maleadt maleadt deleted the branch JuliaGPU:tb/sanitize September 17, 2021 09:10
@maleadt maleadt closed this Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Addition and multiplication over cuarray and cusparse generates inconsistent outcome types or error
2 participants