Skip to content

Create global mlflow run and use it for checkpoints #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: single-controller-hackathon
Choose a base branch
from

Conversation

irenedea
Copy link
Collaborator

@irenedea irenedea commented Aug 8, 2025

  • creates a global mlflow run that composer will reuse (We can use this global run to log other metrics outside of the training run 😄 )
  • saves checkpoints to mlflow, so we can autoresume on non-interactive runs

After this PR, we should load the experience buffer from the checkpoints in order for checkpoints to work correctly with async. (Shouldn't be too hard..) It only works for sync right now.

https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/723944411900647/runs/fcbceb3f3c9142539744a0883575ab0a/system-metrics?o=7395834863327820

image

You can see the metrics /system metrics for two iterations, where the second was a resumption. This was a super small dummy run, so the loss values seem to not show up when they repeat at 0.0... 🤷‍♀️

@rithwik-db rithwik-db force-pushed the irene/checkpoints-mlflow branch from 3e87ee0 to b46fc36 Compare August 10, 2025 01:01
print(f'Autoresuming from checkpoint for RolloutAgent.')
with open(self.latest_checkpoint, 'rb') as f:
get_file(self.latest_checkpoint_path, self.latest_checkpoint_path, overwrite=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: We probably need to use the right path here instead of just using self.latest_checkpoint_path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants