Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: checkpoint, stopping logic #178

Open
RobotSail opened this issue Aug 26, 2024 · 0 comments
Open

Refactor: checkpoint, stopping logic #178

RobotSail opened this issue Aug 26, 2024 · 0 comments

Comments

@RobotSail
Copy link
Member

We currently have the library set up to perform actions such as saving checkpoints and finishing training based on how long we've been training for. But the criteria for these can differ in ways such as:

  • training for N epochs
  • training up to K samples seen
  • saving every N epochs OR saving every K samples

There are also different types of checkpoints to save. For full fine-tuning we may export a HuggingFace formatted checkpoint for users to consume, but we may also want to occasionally checkpoint the DeepSpeed state for later resumption of training.

For the reasons listed above, we should refactor the way we communicate the criteria for actions to be simpler and allow for the library to be more predictable.

@instructlab instructlab deleted a comment Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant