Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmarks on GPU #249

Open
andigu opened this issue Aug 26, 2024 · 2 comments
Open

Performance benchmarks on GPU #249

andigu opened this issue Aug 26, 2024 · 2 comments

Comments

@andigu
Copy link

andigu commented Aug 26, 2024

The paper presents an interesting approach with promising results; however, the performance section would benefit from greater specificity regarding the hardware used for the experiments. For instance, it is unclear whether the results were obtained using a particular GPU model, and details such as memory capacity and other relevant hardware specifications are not provided. This information is crucial for reproducibility and for understanding the context of the reported performance.

Additionally, it would be helpful to elaborate on the memory limitations encountered, particularly why the memory saturates at 100 samples. Given that modern GPUs, such as the NVIDIA A100, are equipped with up to 80GB of memory, it seems plausible that they could handle significantly more than 100 images of size 128x128. Discussing this discrepancy could provide valuable insights into the potential limitations of the method or implementation choices that may have impacted memory usage.

@ConnorStoneAstro
Copy link
Member

This is a great point, we mentioned in the figure caption that we used a V100 GPU, but some discussion on the hardware capabilities would be nice.

Indeed 100 images at 128x128 is not very hard to store even in the V100. However, we did the sampling at 4x upsampling, which adds a factor of 16 to the memory load. At 64 bit floating point data, this means that each operation is using almost 2GB in memory. Between pytorch overhead and intermediate calculations (ie storing FFT transformed images for convolution), we are using almost all the memory of the V100. With an A100 we could easily go past this threshold, however we chose not to use A100s because our goal was to use the performance tests to make a quick discussion about how GPUs work, they have flat performance graphs until hitting their saturation, which is a bit different from a lot of people's experience with performant code and changes how one should use it.

I'll work on an updated version of the text and let you know when it is ready!

@ConnorStoneAstro
Copy link
Member

Hi @andigu we have added some discussion on the GPU performance and how the memory scaling works. This is great material to add since we hoped this section could be a starting point for new users to understand how to efficiently use GPUs for scientific computing. Now we explain how the runtime remains constant until the GPU saturates and we enter a linear regime. We have re-generated the paper on the JOSS review page, please let me know if you think any further discussion is needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@andigu @ConnorStoneAstro and others