Skip to content

Conversation

@tjhunter
Copy link
Collaborator

@tjhunter tjhunter commented Mar 12, 2025

Closes #29

First part: prototyping the new format.


# TODO: performance: we repeatedly open the file for each call. Better for multiprocessing
# but we can probably do better and rely for example on the logging module.
with open(os.path.join(self.path_run, "metrics.json"), "ab") as f:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we start with this simple version, we can always improve performance if it turns out to be a bottleneck

@tjhunter tjhunter marked this pull request as ready for review March 13, 2025 16:18
]

[tool.uv.sources]
flash-attn = { url = "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make this change to use uv on the hpc2020 cluster. I am not sure if this is going to be a breaking change for people. @clessig , do we assume that different HPCs can use different versions of CUDA? That sounds like a nightmare.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not assume it, we know it ;) One can write a script that detects the available CUDA (and the python version if this is a variable) and then assembles the string that defines the wheel to be downloaded. @tjhunter : To what extent could one integrate this into pyproject toml?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could we open an issues to track this? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the script in branch of the private repo but not committed yet:
#57

@tjhunter tjhunter merged commit 1dece82 into develop Mar 17, 2025
3 checks passed
@tjhunter
Copy link
Collaborator Author

Ass discussed, will be followed up by #90

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Refactor TrainLogger

4 participants