Skip to content

Conversation

@aednichols
Copy link
Collaborator

@aednichols aednichols commented Nov 19, 2024

Description

It is silly to implement cost capping support for a GPU that's no longer usable since May 2024. Also addressed some ambiguities around GPUs in GCP Batch, since I was already in the right place in the code.

  • Remove K80 from code, tests, documentation
    • Replaced default K80 with P100 on Life Sciences
    • Replaced default K80 with T4 on GCP Batch
  • Update GCP Batch library to a version that supports the driver version field
    • Deferred due to crash in tests
  • Document that GCP Batch does support nvidiaDriverVersion field, but Cromwell does not currently pass it
  • Revive and modernize GPU hardware Centaur test that we disabled in 2020
    • Check for GPU model and VRAM amount
    • Test no longer checks for driver version

Similar to DataBiosphere/terra-ui#5167

nvidia-smi output on Life Sciences:

Tue Nov 19 16:25:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

nvidia-smi output on Batch:

Tue Nov 19 16:19:04 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   53C    P8             10W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • [] I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@aednichols aednichols requested a review from a team as a code owner November 19, 2024 15:48
val NVIDIATeslaT4 = GpuType("nvidia-tesla-t4")

val DefaultGpuType: GpuType = NVIDIATeslaK80
val DefaultGpuType: GpuType = NVIDIATeslaT4
Copy link
Collaborator Author

@aednichols aednichols Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to opinions on this default. T4 is the most modern GPU of the lot and the one I see used most often in practice.

I also think few people rely on the default GPU anyway, which is a good thing – because Google deleted the old default!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T4 make sense. It would either be that or the V100 and the T4 is cheaper.

Copy link
Collaborator

@dspeck1 dspeck1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aednichols aednichols changed the title AN-144 Remove Nvidia Tesla K80 GPU support AN-296 Remove Nvidia Tesla K80 GPU support Nov 21, 2024
@aednichols aednichols changed the title AN-296 Remove Nvidia Tesla K80 GPU support AN-291 Remove Nvidia Tesla K80 GPU support Nov 21, 2024
@aednichols aednichols enabled auto-merge (squash) November 25, 2024 15:09
@aednichols aednichols merged commit 4d01b91 into develop Nov 25, 2024
37 checks passed
@aednichols aednichols deleted the aen_an_144 branch November 25, 2024 16:16
@sam-schu sam-schu mentioned this pull request Dec 20, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants