Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pin cuda-nvcc as a temporary workaround #389

Merged
merged 2 commits into from
Oct 13, 2022
Merged

Conversation

ngam
Copy link
Contributor

@ngam ngam commented Oct 12, 2022

fix #387 (potentially, pending more testing)

@github-actions
Copy link
Contributor

Binder 👈 Try on Mybinder.org!
Binder 👈 Try on Pangeo GCP Binder!
Binder 👈 Try on Pangeo AWS Binder!

@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@ngam
Copy link
Contributor Author

ngam commented Oct 12, 2022

Alternatively, one could use this fix:

import os
os.environ["XLA_FLAGS"]="--xla_gpu_force_compilation_parallelism=1"

Closing!

@ngam ngam closed this Oct 12, 2022
@ngam ngam reopened this Oct 13, 2022
@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@ngam ngam reopened this Oct 13, 2022
@pangeo-bot
Copy link
Collaborator

/condalock
Automatically locking new conda environment, building, and testing images...

@scottyhq scottyhq merged commit 118d497 into pangeo-data:master Oct 13, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Oct 14, 2022
Brings in pangeo-data/pangeo-docker-images#389

Based on pangeo-data/pangeo-docker-images#390,
start making T4 the default. Folks can still use K80 if they want.

This makes it easier to use CUDA based GPU accelerated code.

Follow-up to 2i2c-org#1766
@yuvipanda
Copy link
Member

@ngam @dhruvbalwada is this only to support k80? We can just say we no longer support k80s and switch just to T4s. In 2i2c-org/infrastructure#1772 I make T4s the default

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Oct 14, 2022
Brings in pangeo-data/pangeo-docker-images#389

Based on pangeo-data/pangeo-docker-images#390,
start making T4 the default. Folks can still use K80 if they want.

This makes it easier to use CUDA based GPU accelerated code.

Follow-up to 2i2c-org#1766
@dhruvbalwada
Copy link
Member

This PR will provide support for T4s only. The problem with K80s was not resolved and requires turning multi-threading off, so maybe just switch over to T4 entirely.

@yuvipanda
Copy link
Member

@dhruvbalwada yeah, 2i2c-org/infrastructure#1772 makes T4s the default - K80s are still available there. We only have quota for 4 T4s, should we get more?

@dhruvbalwada
Copy link
Member

dhruvbalwada commented Oct 14, 2022

I don't know the answer to that one yet. I personally probably won't be using more than that (probably will be using only 1 GPU) for the next few months. Maybe if we need more, then we can increase at a later point.

Also, based on looking at the logs, do you see any other users reaching the limit often? If so, maybe we can ask them personally if they would like more?

Edit: I actually didn't even realize that more than one GPU can be used. How do we access them?

@ngam
Copy link
Contributor Author

ngam commented Oct 14, 2022

@yuvipanda, I am obviously not associated with LEAP, M2LINES, etc. so I have no idea what's actually going on behind the scenes. I also don't use the public pangeo-hub ones. However, I am happy to help if needed.

Following the conversations in multiple threads, I think an interesting proposition for this org to consider is optimizing the software to specific hardware. This is essentially the problem here. If one could confirm and pin the hardware specs (cuda drivers, highest avx, etc.) then one could either compile pieces of software targeting just that or at least work with conda-forge to ensure proper support is handled correctly.

To reiterate though, this is not a "problem" in the large scheme of things, it is just a minor issue in the compute graph compilation (parallel or not) so an even easier and under-wraps solution is to set an environment variable in the images that will suppress the unnecessary error entirely: XLA_FLAGS=--xla_gpu_force_compilation_parallelism=1.

@dhruvbalwada in your work or in your examples, try working with and without the parallelism evn var and see how things turn out. I suspect there will be no difference at all in terms of speed, but that's just a guess.

@rabernat
Copy link
Member

Edit: I actually didn't even realize that more than one GPU can be used. How do we access them?

These quotas are for the whole project. It means only 4 people can be using T4s at once on the entire M2LInES hub. We don't support multiple GPUs per user, but could potentially try that if people thought it would be useful.

In terms of general usage, we have a bit of chicken vs. egg problem here. We have not done much work to explain to users how to take advantage of these resources. We don't have demos, examples, training, etc. So it is not surprising that usage is low.

In the short term, it's fine to drop K80s and use only T4s. I doubt that we would have more than 4 simultaneous GPU users in the project.

@yuvipanda
Copy link
Member

We could definitely try multiple GPUs on the same pod if folks think that would be useful. I've made T4s the default now, but I think the suggestion is that we drop K80s completely. https://cloud.google.com/compute/docs/gpus#nvidia_gpus_for_compute_workloads is the list of GPUs available, let me know if we should make any other ones available. It is fairly trivial to do so!

@ngam
Copy link
Contributor Author

ngam commented Oct 14, 2022

We don't support multiple GPUs per user, but could potentially try that if people thought it would be useful.

I think it is quite difficult to justify using multiple GPUs (cost, compute, etc.) simultaneously (i.e. using nvlink or something). Unless someone showcases a very specific example where it is beneficial, I wouldn't do it. Obviously, someone may make the argument of running completely parallel (independent) jobs (e.g. two different training runs simultaneously), in which case my point doesn't apply.

... let me know if we should make any other ones available.

Obviously A100s are super efficient and powerful, but again, a user must justify the expense. It will likely be cost effective if a user can actually show that they will be able to use an A100 effectively. (Note that's the purpose of my optimizations with XLA, etc. --- to make use of all that A100 offers.) In other words, to run the same thing on T4 and A100, it will actually be cheaper on A100 if all optimizations are being fully utilized. Without naming names, I know someone from your target users who was using many A100s where I have access, and the utilization was 20% and only for a fraction of the allocation time. In that case, obviously, it is a total waste, though the A100s are on the cluster and so it is just longer waiting times for others (and wasted electricity). It takes effort and time to make the code efficient; and so I would put the onus on the user to prove the cost efficiency before wasting resources. In the scientific computing, we are pretty bad about efficient utilization of resources in my experience, so just my 2c on this topic for now. Save your money by making T4 available; if someone can show good use of A100, then make it available for them; it's a total game changer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using jax in ml-notebook instantly kills kernel
6 participants