pin cuda-nvcc as a temporary workaround #389

ngam · 2022-10-12T19:30:39Z

fix #387 (potentially, pending more testing)

github-actions · 2022-10-12T19:30:54Z

👈 Try on Mybinder.org!
👈 Try on Pangeo GCP Binder!
👈 Try on Pangeo AWS Binder!

pangeo-bot · 2022-10-12T19:30:54Z

/condalock
Automatically locking new conda environment, building, and testing images...

ngam · 2022-10-12T19:51:51Z

Alternatively, one could use this fix:

import os
os.environ["XLA_FLAGS"]="--xla_gpu_force_compilation_parallelism=1"

Closing!

pangeo-bot · 2022-10-13T17:32:14Z

/condalock
Automatically locking new conda environment, building, and testing images...

pangeo-bot · 2022-10-13T17:54:54Z

/condalock
Automatically locking new conda environment, building, and testing images...

Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766

yuvipanda · 2022-10-14T03:38:42Z

@ngam @dhruvbalwada is this only to support k80? We can just say we no longer support k80s and switch just to T4s. In 2i2c-org/infrastructure#1772 I make T4s the default

Brings in pangeo-data/pangeo-docker-images#389 Based on pangeo-data/pangeo-docker-images#390, start making T4 the default. Folks can still use K80 if they want. This makes it easier to use CUDA based GPU accelerated code. Follow-up to 2i2c-org#1766

dhruvbalwada · 2022-10-14T04:32:08Z

This PR will provide support for T4s only. The problem with K80s was not resolved and requires turning multi-threading off, so maybe just switch over to T4 entirely.

yuvipanda · 2022-10-14T05:02:13Z

@dhruvbalwada yeah, 2i2c-org/infrastructure#1772 makes T4s the default - K80s are still available there. We only have quota for 4 T4s, should we get more?

dhruvbalwada · 2022-10-14T12:57:37Z

I don't know the answer to that one yet. I personally probably won't be using more than that (probably will be using only 1 GPU) for the next few months. Maybe if we need more, then we can increase at a later point.

Also, based on looking at the logs, do you see any other users reaching the limit often? If so, maybe we can ask them personally if they would like more?

Edit: I actually didn't even realize that more than one GPU can be used. How do we access them?

ngam · 2022-10-14T13:31:32Z

@yuvipanda, I am obviously not associated with LEAP, M2LINES, etc. so I have no idea what's actually going on behind the scenes. I also don't use the public pangeo-hub ones. However, I am happy to help if needed.

Following the conversations in multiple threads, I think an interesting proposition for this org to consider is optimizing the software to specific hardware. This is essentially the problem here. If one could confirm and pin the hardware specs (cuda drivers, highest avx, etc.) then one could either compile pieces of software targeting just that or at least work with conda-forge to ensure proper support is handled correctly.

To reiterate though, this is not a "problem" in the large scheme of things, it is just a minor issue in the compute graph compilation (parallel or not) so an even easier and under-wraps solution is to set an environment variable in the images that will suppress the unnecessary error entirely: XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1.

@dhruvbalwada in your work or in your examples, try working with and without the parallelism evn var and see how things turn out. I suspect there will be no difference at all in terms of speed, but that's just a guess.

rabernat · 2022-10-14T14:06:46Z

Edit: I actually didn't even realize that more than one GPU can be used. How do we access them?

These quotas are for the whole project. It means only 4 people can be using T4s at once on the entire M2LInES hub. We don't support multiple GPUs per user, but could potentially try that if people thought it would be useful.

In terms of general usage, we have a bit of chicken vs. egg problem here. We have not done much work to explain to users how to take advantage of these resources. We don't have demos, examples, training, etc. So it is not surprising that usage is low.

In the short term, it's fine to drop K80s and use only T4s. I doubt that we would have more than 4 simultaneous GPU users in the project.

yuvipanda · 2022-10-14T18:45:28Z

We could definitely try multiple GPUs on the same pod if folks think that would be useful. I've made T4s the default now, but I think the suggestion is that we drop K80s completely. https://cloud.google.com/compute/docs/gpus#nvidia_gpus_for_compute_workloads is the list of GPUs available, let me know if we should make any other ones available. It is fairly trivial to do so!

ngam · 2022-10-14T18:59:15Z

We don't support multiple GPUs per user, but could potentially try that if people thought it would be useful.

I think it is quite difficult to justify using multiple GPUs (cost, compute, etc.) simultaneously (i.e. using nvlink or something). Unless someone showcases a very specific example where it is beneficial, I wouldn't do it. Obviously, someone may make the argument of running completely parallel (independent) jobs (e.g. two different training runs simultaneously), in which case my point doesn't apply.

... let me know if we should make any other ones available.

Obviously A100s are super efficient and powerful, but again, a user must justify the expense. It will likely be cost effective if a user can actually show that they will be able to use an A100 effectively. (Note that's the purpose of my optimizations with XLA, etc. --- to make use of all that A100 offers.) In other words, to run the same thing on T4 and A100, it will actually be cheaper on A100 if all optimizations are being fully utilized. Without naming names, I know someone from your target users who was using many A100s where I have access, and the utilization was 20% and only for a fraction of the allocation time. In that case, obviously, it is a total waste, though the A100s are on the cluster and so it is just longer waiting times for others (and wasted electricity). It takes effort and time to make the code efficient; and so I would put the onus on the user to prove the cost efficiency before wasting resources. In the scientific computing, we are pretty bad about efficient utilization of resources in my experience, so just my 2c on this topic for now. Save your money by making T4 available; if someone can show good use of A100, then make it available for them; it's a total game changer.

ngam closed this Oct 12, 2022

ngam reopened this Oct 13, 2022

ngam mentioned this pull request Oct 13, 2022

Using jax in ml-notebook instantly kills kernel #387

Closed

ngam marked this pull request as ready for review October 13, 2022 17:53

ngam closed this Oct 13, 2022

ngam force-pushed the ptxas_fix branch from 675f5e1 to 2176755 Compare October 13, 2022 17:54

pin cuda-nvcc

0094860

ngam reopened this Oct 13, 2022

[condalock-command] autogenerated conda-lock files

61a8f7b

scottyhq approved these changes Oct 13, 2022

View reviewed changes

scottyhq merged commit 118d497 into pangeo-data:master Oct 13, 2022

yuvipanda mentioned this pull request Oct 14, 2022

Use ML images with correct CUDA versions 2i2c-org/infrastructure#1772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pin cuda-nvcc as a temporary workaround #389

pin cuda-nvcc as a temporary workaround #389

ngam commented Oct 12, 2022 •

edited

Loading

github-actions bot commented Oct 12, 2022

pangeo-bot commented Oct 12, 2022

ngam commented Oct 12, 2022

pangeo-bot commented Oct 13, 2022

pangeo-bot commented Oct 13, 2022

yuvipanda commented Oct 14, 2022

dhruvbalwada commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

dhruvbalwada commented Oct 14, 2022 •

edited

Loading

ngam commented Oct 14, 2022

rabernat commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

ngam commented Oct 14, 2022

pin cuda-nvcc as a temporary workaround #389

pin cuda-nvcc as a temporary workaround #389

Conversation

ngam commented Oct 12, 2022 • edited Loading

github-actions bot commented Oct 12, 2022

pangeo-bot commented Oct 12, 2022

ngam commented Oct 12, 2022

pangeo-bot commented Oct 13, 2022

pangeo-bot commented Oct 13, 2022

yuvipanda commented Oct 14, 2022

dhruvbalwada commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

dhruvbalwada commented Oct 14, 2022 • edited Loading

ngam commented Oct 14, 2022

rabernat commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

ngam commented Oct 14, 2022

ngam commented Oct 12, 2022 •

edited

Loading

dhruvbalwada commented Oct 14, 2022 •

edited

Loading