Skip to content

Data Transfer from CPU to GPU is not optimized #1399

@javak87

Description

@javak87

What happened?

The initial investigation shows that training on the A100 (JWB) is faster than on the GH200 (Santis). Here is how to reproduce the result:

git checkout d24c4b6800b45bd1f859e61d8b29eab5a540c176
../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1

here is the result:

run_id HPC PR Ingested Samples per GPU
pyizojg7 Santis develop (1 node) (180 mins) 6684
cc0xrzbm JWB develop (1 node) (180 mins) 7688

What are the steps to reproduce the bug?

No response

Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

No response

Metadata

Metadata

Assignees

Labels

performanceWork related to performance improvements

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions