-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor scaling to multiple GPUs with electrostatic solver #5036
Comments
Thanks for reporting this. I think that the fact that the ES solver does not scale as well as the EM solver is indeed expected. The ES solver does require more MPI communications than the EM solver, and your observations are in line with what other WarpX users have seen when trying to scale the ES solver with multiple GPUs. Nevertheless, it might still be possible to find ways to improve the scaling. Additionally, it could be helpful if you can post the I also know that @pmessmer is interested in speeding up the ES solver in WarpX ; maybe he'd have some suggestions. |
Btw, @archermarx when attempting to run the Python script that you posted (but with
at the first iteration. Is that your case too? Or am I missing something (e.g. are you compiling a modified/older version of WarpX? or are you using non-default compiler flags?) |
Hi Remi, No, running on one proc, this runs to completion on my end. My compiler options are listed below. The only thing non-default I'm using (I think) is single-precision particles. I'm running on WarpX v24.07 # Build warpx
cmake -S . -B build \
-DWarpX_LIB=ON \
-DWarpX_APP=OFF \
-DWarpX_MPI=ON \
-DWarpX_COMPUTE=CUDA \
-DWarpX_DIMS="1;2;3" \
-DWarpX_PYTHON=ON \
-DWarpX_PRECISION=DOUBLE \
-DWarpX_PARTICLE_PRECISION=SINGLE
cmake --build build --target pip_install -j 20 |
EDIT: issue resolved |
After resolving some issues, I have more realistic scaling results. Not nearly as bad as before, but still suboptimal. First, I show the speedup over 1 GPU for different workloads on 1, 2, 4, and 8 GPUs: Next, I show how the speedup grows as a function of workload TinyProf insightsI've attached tinyprof output for 1 GPU and 8 GPU to this file. Here are some of the main insights:
This is a huge fraction. Any idea how to speed this up? tinyprof_1gpu.txt |
EDIT: see more recent results (with profiling) here
Hi all,
I'm trying to scale up an electrostatic simulation to multiple GPUs and am getting very poor results. To diagnose things, I ran uniform plasma simulations using the attached PICMI file, with the only difference between runs being the workload and the solver choice.
The results are shown in the figure below. Here, each line represents a different workload, and the y-axis is the time per step for that simulation at 1 GPU divided by the time per step at the number of GPUs as shown on the x-axis. All simulations are electromagnetic, except for the line labelled "ES".
As I increase the workload when using the electromagnetic solver, the simulation scaling approaches the ideal scaling, which is great.
However, the same is not true for the electrostatic (ES) solver, which does not see a speedup at all even at the largest problem sizes.
Some more details
The base workload ("Work = 1") is 2 particles per cell in each dimension, with 32 x 32 x 32 cells. Work = 4096 corresponds to 8 ppc in each dimension with 128 x 128 x 128 cells, which was the largest simulation I could fit in memory on a single GPU.
I'm running WarpX on GPU with single precision particles. The nodes I'm using have 8x NVIDIA H100 GPUs, so all of these computations are on a single node. I also tested adding a second node (16 GPUs total), but the results were equally poor.
The PICMI input file is here:
My job script is here. I just change
ntasks-per-node
to set the number of GPUSThe text was updated successfully, but these errors were encountered: