Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ The config keys `services.HealthMonitor.config.check-gcs` and `.gcs-bucket-to-ch
Code relating to the Google Genomics API (aka `v1Alpha`) has been removed since Google has entirely disabled that service.
Cloud Life Sciences (aka `v2Beta`, deprecated) and Google Batch (aka `batch`, recommended) remain the two viable GCP backends.

#### GPU changes
* Removed support for Nvidia K80 "Kepler" GPUs, which were [discontinued by GCP in May 2024](https://cloud.google.com/compute/docs/eol/k80-eol).
* Default GPU on Life Sciences is now Nvidia P100
* Default GPU on GCP Batch is now Nvidia T4
* Updated runtime attributes documentation to clarify that the `nvidiaDriverVersion` key is ignored on GCP Batch.

## 87 Release Notes

### GCP Batch
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
name: gpu_cuda_image
testFormat: workflowsuccess
backends: [Papi, GCPBATCH]
ignore: true

files {
workflow: gpu_on_papi/gpu_cuda_image.wdl
}

# As of November 2024, GCP Batch was using driver 550 and Life Sciences 535.
# Neither was on the 418 version that used to be specified in this test.
#
# On Life Sciences it seems to be straight up ignored by the API.
#
# In Batch it is not wired through Cromwell, and we may not do so if we don't find a reason to.

metadata {
status: Succeeded
"outputs.gpu_cuda_image.modprobe_check.0": "good"
"outputs.gpu_cuda_image.smi_check.0": "good"
"outputs.gpu_cuda_image.smi_check": "gpu_good\nvram_good"
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,32 @@ version 1.0

workflow gpu_cuda_image {

input {
Array[String] driver_versions = [ "418.87.00" ]
}

scatter (driver_version in driver_versions) {
call get_machine_info { input: driver_version = driver_version }
}
call get_machine_info

output {
Array[String] modprobe_check = get_machine_info.modprobe_check
Array[String] smi_check = get_machine_info.smi_check

Array[File] modprobe_contents = get_machine_info.modprobe_content
Array[File] smi_contents = get_machine_info.smi_content
String smi_check = get_machine_info.smi_check
File smi_contents = get_machine_info.smi_content
}
}

task get_machine_info {
input {
String driver_version
}

command <<<
nvidia-modprobe --version > modprobe
cat modprobe | grep -q "~{driver_version}" && echo "good" > modprobe_check || echo "bad" > modprobe_check
nvidia-smi > smi
cat smi | grep -q "~{driver_version}" && echo "good" > smi_check || echo "bad" > smi_check
cat smi | grep -q "Tesla T4" && echo "gpu_good" > smi_check || echo "bad" > smi_check
cat smi | grep -q "15360MiB" && echo "vram_good" >> smi_check || echo "bad" >> smi_check
>>>

runtime {
docker: "nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04"
docker: "nvidia/cuda:12.6.2-cudnn-devel-ubuntu24.04"
bootDiskSizeGb: 20
gpuType: "nvidia-tesla-k80"
gpuType: "nvidia-tesla-t4"
gpuCount: 1
nvidiaDriverVersion: driver_version
zones: "us-central1-c"
}

output {
String modprobe_check = read_string("modprobe_check")
String smi_check = read_string("smi_check")
File modprobe_content = "modprobe"
File smi_content = "smi"
}
}
7 changes: 4 additions & 3 deletions docs/RuntimeAttributes.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,19 +395,20 @@ Make sure to choose a zone for which the type of GPU you want to attach is avail

The types of compute GPU supported are:

* `nvidia-tesla-k80`
* `nvidia-tesla-v100`
* `nvidia-tesla-p100`
* `nvidia-tesla-p4`
* `nvidia-tesla-t4`

For the latest list of supported GPU's, please visit [Google's GPU documentation](nvidia-drivers-us-public).

The default driver is `418.87.00`, you may specify your own via the `nvidiaDriverVersion` key. Make sure that driver exists in the `nvidia-drivers-us-public` beforehand, per the [Google Pipelines API documentation](https://cloud.google.com/genomics/reference/rest/Shared.Types/Metadata#VirtualMachine).
On Life Sciences API, the default driver is `418.87.00`. You may specify your own via the `nvidiaDriverVersion` key. Make sure that driver exists in the `nvidia-drivers-us-public` beforehand, per the [Google Pipelines API documentation](https://cloud.google.com/genomics/reference/rest/Shared.Types/Metadata#VirtualMachine).

On GCP Batch, `nvidiaDriverVersion` is currently ignored; Batch selects the correct driver version automatically.

```
runtime {
gpuType: "nvidia-tesla-k80"
gpuType: "nvidia-tesla-t4"
gpuCount: 2
nvidiaDriverVersion: "418.87.00"
zones: ["us-central1-c"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,15 @@
): InstancePolicy.Builder = {

// set GPU count to 0 if not included in workflow
val gpuAccelerators = accelerators.getOrElse(Accelerator.newBuilder.setCount(0).setType("")) // TODO: Driver version
// `setDriverVersion()` is available but we're using the Batch default for now
//
// Nvidia lifecycle reference:
// https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-drivers
//
// GCP docs:
// https://cloud.google.com/batch/docs/create-run-job-gpus#install-gpu-drivers
// https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs#Accelerator.FIELDS.driver_version
val gpuAccelerators = accelerators.getOrElse(Accelerator.newBuilder.setCount(0).setType(""))

Check warning on line 101 in supportedBackends/google/batch/src/main/scala/cromwell/backend/google/batch/api/GcpBatchRequestFactoryImpl.scala

View check run for this annotation

Codecov / codecov/patch

supportedBackends/google/batch/src/main/scala/cromwell/backend/google/batch/api/GcpBatchRequestFactoryImpl.scala#L101

Added line #L101 was not covered by tests

val instancePolicy = InstancePolicy.newBuilder
.setProvisioningModel(spotModel)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,14 @@

object GpuResource {

val DefaultNvidiaDriverVersion = "418.87.00"

final case class GpuType(name: String) {
override def toString: String = name
}

object GpuType {
val NVIDIATeslaP100 = GpuType("nvidia-tesla-p100")
val NVIDIATeslaK80 = GpuType("nvidia-tesla-k80")
val NVIDIATeslaT4 = GpuType("nvidia-tesla-t4")

val DefaultGpuType: GpuType = NVIDIATeslaK80
val DefaultGpuType: GpuType = NVIDIATeslaT4
Copy link
Collaborator Author

@aednichols aednichols Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to opinions on this default. T4 is the most modern GPU of the lot and the one I see used most often in practice.

I also think few people rely on the default GPU anyway, which is a good thing – because Google deleted the old default!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T4 make sense. It would either be that or the V100 and the T4 is cheaper.

val DefaultGpuCount: Int Refined Positive = refineMV[Positive](1)
val MoreDetailsURL = "https://cloud.google.com/compute/docs/gpus/"
}
Expand Down Expand Up @@ -95,10 +92,6 @@
private def gpuTypeValidation(runtimeConfig: Option[Config]): OptionalRuntimeAttributesValidation[GpuType] =
GpuTypeValidation.optional

val GpuDriverVersionKey = "nvidiaDriverVersion"
private def gpuDriverValidation(runtimeConfig: Option[Config]): OptionalRuntimeAttributesValidation[String] =
new StringRuntimeAttributesValidation(GpuDriverVersionKey).optional

private def gpuCountValidation(
runtimeConfig: Option[Config]
): OptionalRuntimeAttributesValidation[Int Refined Positive] = GpuValidation.optional
Expand Down Expand Up @@ -149,7 +142,6 @@
.withValidation(
gpuCountValidation(runtimeConfig),
gpuTypeValidation(runtimeConfig),
gpuDriverValidation(runtimeConfig),
cpuValidation(runtimeConfig),
cpuPlatformValidation(runtimeConfig),
disksValidation(runtimeConfig),
Expand Down Expand Up @@ -180,10 +172,8 @@
.extractOption(gpuTypeValidation(runtimeAttrsConfig).key, validatedRuntimeAttributes)
lazy val gpuCount: Option[Int Refined Positive] = RuntimeAttributesValidation
.extractOption(gpuCountValidation(runtimeAttrsConfig).key, validatedRuntimeAttributes)
lazy val gpuDriver: Option[String] =
RuntimeAttributesValidation.extractOption(gpuDriverValidation(runtimeAttrsConfig).key, validatedRuntimeAttributes)

val gpuResource: Option[GpuResource] = if (gpuType.isDefined || gpuCount.isDefined || gpuDriver.isDefined) {
val gpuResource: Option[GpuResource] = if (gpuType.isDefined || gpuCount.isDefined) {

Check warning on line 176 in supportedBackends/google/batch/src/main/scala/cromwell/backend/google/batch/models/GcpBatchRuntimeAttributes.scala

View check run for this annotation

Codecov / codecov/patch

supportedBackends/google/batch/src/main/scala/cromwell/backend/google/batch/models/GcpBatchRuntimeAttributes.scala#L176

Added line #L176 was not covered by tests
Option(
GpuResource(gpuType.getOrElse(GpuType.DefaultGpuType),
gpuCount
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,11 @@ import wom.values.{WomFloat, WomInteger, WomSingleFile, WomString, WomValue}
class GcpBatchGpuAttributesSpec extends AnyWordSpecLike with Matchers with GcpBatchRuntimeAttributesSpecsMixin {

val validGpuTypes = List(
(Option(WomString("nvidia-tesla-k80")), Option(GpuType.NVIDIATeslaK80)),
(Option(WomString("nvidia-tesla-p100")), Option(GpuType.NVIDIATeslaP100)),
(Option(WomString("nvidia-tesla-t4")), Option(GpuType.NVIDIATeslaT4)),
(Option(WomString("custom-gpu-24601")), Option(GpuType("custom-gpu-24601"))),
(None, None)
)
val invalidGpuTypes = List(WomSingleFile("nvidia-tesla-k80"), WomInteger(100))
val invalidGpuTypes = List(WomSingleFile("nvidia-tesla-t4"), WomInteger(100))

val validGpuCounts = List(
(Option(WomInteger(1)), Option(1)),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,8 @@ object GpuResource {
}
object GpuType {
val NVIDIATeslaP100 = GpuType("nvidia-tesla-p100")
val NVIDIATeslaK80 = GpuType("nvidia-tesla-k80")

val DefaultGpuType: GpuType = NVIDIATeslaK80
val DefaultGpuType: GpuType = NVIDIATeslaP100
val DefaultGpuCount: Int Refined Positive = refineMV[Positive](1)
val MoreDetailsURL = "https://cloud.google.com/compute/docs/gpus/"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,11 @@ import wom.values.{WomFloat, WomInteger, WomSingleFile, WomString, WomValue}
class PipelinesApiGpuAttributesSpec extends AnyWordSpecLike with Matchers with PipelinesApiRuntimeAttributesSpecsMixin {

val validGpuTypes = List(
(Option(WomString("nvidia-tesla-k80")), Option(GpuType.NVIDIATeslaK80)),
(Option(WomString("nvidia-tesla-p100")), Option(GpuType.NVIDIATeslaP100)),
(Option(WomString("custom-gpu-24601")), Option(GpuType("custom-gpu-24601"))),
(None, None)
)
val invalidGpuTypes = List(WomSingleFile("nvidia-tesla-k80"), WomInteger(100))
val invalidGpuTypes = List(WomSingleFile("nvidia-tesla-t4"), WomInteger(100))

val validGpuCounts = List(
(Option(WomInteger(1)), Option(1)),
Expand Down
Loading