-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self play takes more and more time #41
Comments
First of all, thanks for filing this issue. This is reminiscent of an issue I've had from the very start of the project, although the effects here look much more dramatic. It is also the case on my computer that performances tend to decrease after each iteration, in a way that is fixed by restarting Julia periodically. This decrease in performances is caused by an increasing amount of time spent in GC (if confirmed, this would explain why you see a decrease of CPU utilization over time as GC recollections block all threads as far as I know). This effect used to be dramatic (see JuliaGPU/CUDA.jl#137) and I remember AlphaZero.jl spending 90% of its time within the GC after one or two iterations. Recently, things have been looking much better on my computer (~20% performance loss over the whole training for connect-four) but the effect is still here. To confirm what is happening in your case, could you share the performance plots that are automatically generated after each iteration? They should be generated automatically in Also, could you run the experiment using different CUDA memory pools by setting JULIA_CUDA_MEMORY_POOL to either "cuda", "split" or "binned" before launching the program? Admittedly, I am still at loss regarding the source of these performance regressions. A natural explanation would be a memory leak in AlphaZero.jl but I don't see how this can happen as it is sharing very little state across training iterations. Hopefully, your reports should get us closer to the truth as the effects you are observing are so dramatic. |
thanks for you reply. I also suspect there is cuda memory leak, I'll do more experiments and post ps, CUDA.jl v3 seems to have much improvement, can't wait for it |
GC costs 60%~90% in self play when use default JULIA_CUDA_MEMORY_POOL and costs just about 15% when set ENV["JULIA_CUDA_MEMORY_POOL"] = "split" maybe we should make Besides, there are 2 ways to start to train connect-four in https://jonathan-laurent.github.io/AlphaZero.jl/dev/tutorial/connect_four/ , but there are still |
I updated the Manifest on #master so that it uses CUDA 3.0.
I should just remove the |
I Do I need to set Besides, I notice that you just add |
this is iter1 and iter2 perfs when setting Error happens when playing in iter 3: Starting iteration 3
======self play starting
Starting self-play
CUDNNError: CUDNN_STATUS_EXECUTION_FAILED (code 8)
Stacktrace:
[1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
@ CUDA.CUDNN ~/.julia/packages/CUDA/Px7QU/lib/cudnn/error.jl:22
[2] macro expansion
@ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/error.jl:39 [inlined]
[3] cudnnActivationForward(handle::Ptr{Nothing}, activationDesc::CUDA.CUDNN.cudnnActivationDescriptor, alpha::Base.RefValue{Float32}, xDesc::CUDA.CUDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float32, 4}, beta::Base.RefValue{Float32}, yDesc::CUDA.CUDNN.cudnnTensorDescriptor, y::CUDA.CuArray{Float32, 4})
@ CUDA.CUDNN ~/.julia/packages/CUDA/Px7QU/lib/utils/call.jl:26
[4] #cudnnActivationForwardAD#645
@ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:48 [inlined]
[5] #cudnnActivationForwardWithDefaults#644
@ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:42 [inlined]
[6] #cudnnActivationForward!#641
@ ~/.julia/packages/CUDA/Px7QU/lib/cudnn/activation.jl:22 [inlined]
[7] #46
@ ~/.julia/packages/NNlibCUDA/ESR3l/src/cudnn/activations.jl:13 [inlined]
[8] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(NNlib.relu), Tuple{CUDA.CuArray{Float32, 4}}})
@ NNlibCUDA ~/.julia/packages/NNlibCUDA/ESR3l/src/cudnn/activations.jl:30
[9] (::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}})(x::CUDA.CuArray{Float32, 4}, cache::Nothing)
@ Flux.CUDAint ~/.julia/packages/Flux/Lffio/src/cuda/cudnn.jl:9
[10] BatchNorm
@ ~/.julia/packages/Flux/Lffio/src/cuda/cudnn.jl:6 [inlined]
[11] applychain(fs::Tuple{Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}}, x::CUDA.CuArray{Float32, 4}) (repeats 2 times)
@ Flux ~/.julia/packages/Flux/Lffio/src/layers/basic.jl:36
[12] (::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}, Flux.Chain{Tuple{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}}}, typeof(+)}, AlphaZero.FluxLib.var"#15#16"}}}})(x::CUDA.CuArray{Float32, 4})
@ Flux ~/.julia/packages/Flux/Lffio/src/layers/basic.jl:38
[13] forward(nn::ResNet, state::CUDA.CuArray{Float32, 4})
@ AlphaZero.FluxLib ~/code/jls/AlphaZero.jl/src/networks/flux.jl:142
[14] forward_normalized(nn::ResNet, state::CUDA.CuArray{Float32, 4}, actions_mask::CUDA.CuArray{Float32, 2})
@ AlphaZero.Network ~/code/jls/AlphaZero.jl/src/networks/network.jl:260
[15] evaluate_batch(nn::ResNet, batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}})
@ AlphaZero.Network ~/code/jls/AlphaZero.jl/src/networks/network.jl:308
[16] fill_and_evaluate(net::ResNet, batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}; batch_size::Int64, fill::Bool)
@ AlphaZero ~/code/jls/AlphaZero.jl/src/simulations.jl:32
[17] (::AlphaZero.var"#34#35"{Bool, ResNet, Int64})(batch::Vector{NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}})
@ AlphaZero ~/code/jls/AlphaZero.jl/src/simulations.jl:49
[18] macro expansion
@ ~/code/jls/AlphaZero.jl/src/batchifier.jl:62 [inlined]
[19] macro expansion
@ ~/code/jls/AlphaZero.jl/src/util.jl:57 [inlined]
[20] (::AlphaZero.Batchifier.var"#1#3"{AlphaZero.var"#34#35"{Bool, ResNet, Int64}, Int64, Channel{Any}})()
@ AlphaZero.Batchifier ./threadingconstructs.jl:169
|
If you look at the Manifest, I am now using the dg/cuda16 development branch of Flux, which should be merged pretty soon in a new patch release (see FluxML/Flux.jl#1571).
With CUDA 3.0, you should set it to "cuda" or not set it at all ("cuda" is the default memory pool provided that your CUDA toolkit has version >11.2). I agree that the comment in
These kinds of errors often mean an OOM in disguise... Given your config, this probably indicates a memory leak somewhere... It would be interesting to know if you can get the same error using the "cuda" memory pool. Thanks for your help in figuring this out. This is very helpful! |
thanks. I'll test 'cuda' memory pool~ |
seems that we find the reason. my computer was broken yesterday, it takes me some time to recovery it. I'll update CUDA version and test it again. |
Please note that CUDA.jl usually installs its own version of the CUDA toolkit. Therefore, if your CUDA driver is compatible with 11.2, you might not even need to update your system's CUDA installation. You can use |
Yeah, I remember when I julia> CUDA.versioninfo()
CUDA toolkit 11.0.3, artifact installation
CUDA driver 11.0.0
NVIDIA driver 450.102.4
Libraries:
- CUBLAS: 11.2.0
- CURAND: 10.2.1
- CUFFT: 10.2.1
- CUSOLVER: 10.6.0
- CUSPARSE: 11.1.1
- CUPTI: 13.0.0
- NVML: 11.0.0+450.102.4
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.6.0
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: GeForce RTX 2080 Ti (sm_75, 10.367 GiB / 10.759 GiB available)
I'll do it manually |
I think it's because my Nvidia driver is too low. When I update Nvidia driver to Interestingly, after upgrade driver, I install cuda tookit from Nvidia website, which is 11.3. When using alphazero, cuda still downloads artifact cuda toolkit 11.2. No matter whether I set |
I find another problem: Running on my Julia version, Nvidia Driver / CUDA Toolkit version, AlphaZero.jl code, I try to keep everything same. It's so strange. Yeah, this maybe another issue, I'll do more digging. |
On CUDA 3.0, "cuda" is the default memory pool and it will therefore be used whether you set ENV["JULIA_CUDA_MEMORY_POOL"] = "cuda" or not.
This is because 11.3 is recent and the CUDA.jl maintainers haven't tested it properly yet. I imagine this will be fixed in the next release though.
This is surprising indeed and something I have also observed in the past (running AlphaZero.jl on supposedly more performant machines results in worse performances). One possible reason (which I am currently investigating) is that Julia currently does not allow tasks to migrate between threads and so random circumstances that influence what task gets assigned to what thread by the scheduler may result in unbalanced CPU loads. |
I suspect maybe there is too much threads(each thread for a worker, 128 threads vs my 14 core cpu), leading to much context switch. I think the best is to make either cpu or gpu nearly 100% utillized. I'm trying to use a threadpool to see if we can get there. Besides, maybe there is still cuda memory leak. It just crash after iter3, throwing cuda error. I post a issuse here JuliaGPU/CUDA.jl#866 |
Last time I checked, 128 workers (not 128 threads: Julia will spawn 128 tasks and spread them on as many threads as you have CPU cores available) were faster on my computer than 64. The goal is not so much to parallelize simulation but to send big batches to the neural network. One reason the GPU utilization is not currently higher (beyond time spent in GC) is that the inference server currently stops the world as it only runs when all of the 128 workers as stuck on an inference request. Therefore, every time it is done running, it must stay idle as it waits for all the workers to send data again. I am going to try and see if I can optimize this. Ultimately, I agree with you that we should shoot for either ~100% GPU utilization or CPU utilization and your criterion is a good one. Also, it is very good that you are investigating possible memory leaks in CUDA and I will be investigating this. |
I just pushed a change that enables a major speedup on my computer (I went from ~40% GPU utilization during self-play to ~70%). Indeed, I am allowing the number of simulation agents to be larger than the batch size of the inference server so that the CPUs can keep simulating games while the GPU is running. You may want to try it out. Also, if your computer has a lot of RAM (>=32G) and a powerful GPU, you may want to increase the number of simulation workers and the batch size:
|
Wow, you make a lot of commit. thanks very much. I'll test it and post the result~ |
I test the default param: params.self_play.sim.batch_size=64
params.self_play.sim.num_workers=128 it is about 40% speed up, I'll do more experiments and try to figure it out. When I'm ready, I'll post another issue or make a pr. |
Hi Jonathan!
I'm trying to tune AlphaZero.jl hyperparameters recently, and find some problems. With master(commit 91bb698) and nothing changed, I find that self play takes more and more time.
iter1: 49m gpu 33% cpu 300%
iter2: 2h2m gpu 15% cpu 330%
iter3: 7h30m gpu 4% cpu 230%
memory has 54G free.
this is so strange.
Below is my system info:
cpu: Intel(R) Core(TM) i9-10940X CPU @ 3.30GH 14 physical cores 28 threads
memory: 64G
gpu: NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 , RTX2080ti
OS: ubuntu18.04
I think either cpu or gpu fully utilized is ok, but no matter how I change parameters, I just can't make it. And even worse, iter2 use less gpu than iter1, and iter3 even less.
The text was updated successfully, but these errors were encountered: