Skip to content
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.
This repository has been archived by the owner on Mar 12, 2021. It is now read-only.

Performance issue with v2.1.0 compared with v1.7.3 #701

Closed

Description

Describe the bug
The performance of CuArrays@v2.1.0 is slower compared to v1.7.3 for small models.

To Reproduce
The Minimal Working Example (MWE) for this bug:

(@v1.4) pkg> st
  [587475ba] Flux v0.10.4
  [3a865a2d] CuArrays v2.1.0 #master (https://github.com/JuliaGPU/CuArray
  [be33ccc6] CUDAnative v3.0.4

julia> using Flux,CuArrays

julia> model = Chain(
           Dense(4, 128, relu),
           Dense(128, 128, relu),
           Dense(128, 2),
       ) |> gpu
Chain(Dense(4, 128, relu), Dense(128, 128, relu), Dense(128, 2))

julia> @benchmark  CuArrays.@sync model($(cu(rand(4))))
BenchmarkTools.Trial: 
  memory estimate:  8.80 KiB
  allocs estimate:  276
  --------------
  minimum time:     93.864 μs (0.00% GC)
  median time:      115.179 μs (0.00% GC)
  mean time:        125.542 μs (1.97% GC)
  maximum time:     50.622 ms (48.86% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> CuArrays.version()
v"10.1.243"

For comparison:

(@v1.4) pkg> st
  [be33ccc6] CUDAnative v2.10.2
  [3a865a2d] CuArrays v1.7.3
  [587475ba] Flux v0.10.3

julia> @benchmark  CuArrays.@sync model($(cu(rand(4))))
BenchmarkTools.Trial: 
  memory estimate:  8.16 KiB
  allocs estimate:  223
  --------------
  minimum time:     45.627 μs (0.00% GC)
  median time:      74.875 μs (0.00% GC)
  mean time:        85.175 μs (2.61% GC)
  maximum time:     32.836 ms (33.09% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> CUDAdrv.version()
v"10.1.0"

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete this section)
Details on Julia:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Additional context
Add any other context about the problem here.
Test with RTX 2080ti

Note that the model is quite small above. For some large models, the performance is similar between v2.1.0 and v1.7.3. However, I'm still quite interested in why there's a significant difference with small models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions