-
Notifications
You must be signed in to change notification settings - Fork 51
Open
Labels
performanceWork related to performance improvementsWork related to performance improvements
Description
What happened?
The initial investigation shows that training on the A100 (JWB) is faster than on the GH200 (Santis). Here is how to reproduce the result:
git checkout d24c4b6800b45bd1f859e61d8b29eab5a540c176
../WeatherGenerator-private/hpc/launch-slurm.py --time 180 --nodes=1
here is the result:
| run_id | HPC | PR | Ingested Samples per GPU |
|---|---|---|---|
| pyizojg7 | Santis | develop (1 node) (180 mins) | 6684 |
| cc0xrzbm | JWB | develop (1 node) (180 mins) | 7688 |
What are the steps to reproduce the bug?
No response
Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.
No response
Metadata
Metadata
Assignees
Labels
performanceWork related to performance improvementsWork related to performance improvements
Type
Projects
Status
No status