Skip to content

feat: reinit="create_new" setting for multiple simultaneous runs in one process #9562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: graphite-base/9562
Choose a base branch
from

Conversation

timoffex
Copy link
Contributor

@timoffex timoffex commented Mar 5, 2025

Adds a create_new option to the reinit setting which causes wandb.init() to create a new run even if other runs exist. This should eventually become the default, and the other options ("return_previous" and "finish_previous") should be deprecated.

Although it's possible to pass reinit as a wandb.init() argument, the best way to use this is with wandb.setup().

Here is an example that uses the create_new setting to (serially) run multiple sub-experiments while logging their results to another run:

wandb.setup(wandb.Settings(reinit="create_new"))

with wandb.init() as experiment_results_run:
    for ...:
        with wandb.init() as run:
            # The do_experiment() function logs fine-grained metrics
            # to the given run and then returns result metrics that
            # we'd like to track separately.
            experiment_results = do_experiment(run)

            # After each experiment, we log its results to a parent
            # run. Each point in the parent run's charts corresponds
            # to one experiment's results.
            experiment_results_run.log(experiment_results)

This PR also removes util._is_offline() and replaces it with a heuristic that uses the newly added most_recent_active_run. This way, _is_offline continues to work as expected in most cases.

Copy link
Contributor Author

timoffex commented Mar 5, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@timoffex timoffex changed the title reinit allow feat: reinit="allow" setting for multiple simultaneous runs in one process Mar 5, 2025
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 93.51852% with 7 lines in your changes missing coverage. Please review.

Project coverage is 79.71%. Comparing base (a4587d3) to head (867ad2f).

Files with missing lines Patch % Lines
wandb/beta/workflows.py 20.00% 4 Missing ⚠️
wandb/jupyter.py 33.33% 2 Missing ⚠️
wandb/sdk/data_types/base_types/wb_value.py 92.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                      Coverage Diff                      @@
##           04-10-wandb_magic_cleanup    #9562      +/-   ##
=============================================================
- Coverage                      80.40%   79.71%   -0.70%     
=============================================================
  Files                            781      782       +1     
  Lines                          81118    81172      +54     
=============================================================
- Hits                           65225    64707     -518     
- Misses                         15164    15738     +574     
+ Partials                         729      727       -2     
Flag Coverage Δ
func 46.11% <52.94%> (+0.01%) ⬆️
system 64.78% <93.20%> (-0.04%) ⬇️
unit 63.81% <28.76%> (-1.86%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tests/conftest.py 98.35% <100.00%> (-0.41%) ⬇️
tests/system_tests/test_core/test_wandb_init.py 100.00% <ø> (ø)
...s/system_tests/test_core/test_wandb_init_reinit.py 100.00% <100.00%> (ø)
tests/system_tests/test_core/test_wandb_setup.py 100.00% <100.00%> (ø)
wandb/sdk/artifacts/artifact.py 88.28% <100.00%> (+0.03%) ⬆️
...db/sdk/artifacts/storage_handlers/azure_handler.py 67.05% <ø> (ø)
wandb/sdk/data_types/table.py 89.90% <100.00%> (ø)
wandb/sdk/data_types/trace_tree.py 80.89% <ø> (ø)
wandb/sdk/internal/profiler.py 75.00% <ø> (ø)
wandb/sdk/wandb_init.py 84.49% <100.00%> (+0.16%) ⬆️
... and 7 more

... and 33 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d7c7f89 to d6495b6 Compare March 5, 2025 23:44
@timoffex timoffex force-pushed the 03-04-reinit_string branch from 57349c1 to 219caae Compare March 5, 2025 23:44
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d6495b6 to a214c23 Compare March 6, 2025 00:33
@timoffex timoffex marked this pull request as ready for review March 6, 2025 00:34
@timoffex timoffex requested a review from a team as a code owner March 6, 2025 00:34
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from a214c23 to dfef31c Compare March 6, 2025 00:39
@timoffex timoffex changed the base branch from 03-04-reinit_string to graphite-base/9562 March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from dfef31c to 97e1220 Compare March 6, 2025 01:25
@timoffex timoffex force-pushed the graphite-base/9562 branch from 219caae to 1459bda Compare March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 97e1220 to 34d6234 Compare March 6, 2025 01:25
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 6, 2025 01:25
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 34d6234 to ef35260 Compare March 6, 2025 01:25
@timoffex timoffex requested a review from kptkin March 10, 2025 19:09
Comment on lines 281 to 396
Note that if another run already exists, `wandb.run` will continue to
refer to it and will not be updated to the new run. When that run
finishes, `wandb.run` will become `None`. This may affect some
integrations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this behavior of global run is going to cause some confusion with users, as it will require to carefully track when they create and finish each run, maybe a better solution is to disallow global run when we have multiple active runs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have multiple global runs? and they need to use id or something, not sure if this approach is better, but an alternative to fully disabling global run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is for integrations; end users should avoid using wandb.run.

An alternative I considered is to make accesses to wandb.run raise an error after reinit="allow" is used. But we check wandb.run often in our own code (for example, to get its settings), so without fixing all those instances, it would break the SDK.

So wandb.run needs to either be None or a run. If starting a reinit="allow" run were to change it to None or the new run, then integrations relying on wandb.run that were initialized before the run would break or misbehave.

The rule in this PR is that wandb.run can only change when you call wandb.init() and refers to the same run until that run finishes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm i see... i guess i was thinking of integrations in third party repo, where i don't think they change these settings, but i guess there is a case for integrations in our code base.

i guess it's fine, we probably should document it and discourage certain usage patterns especially with integrations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't mention using wandb.run in our guide for integrations, and I recently updated those docs and some other top-level docs to not use the global run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated reinit="create_new" to not touch wandb.run as discussed!

@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 00:03
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 00:04
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:56
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:57
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:58
@timoffex timoffex changed the base branch from graphite-base/9562 to main March 25, 2025 16:58
@timoffex timoffex changed the base branch from main to graphite-base/9562 March 25, 2025 16:59
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from ef35260 to 24e7034 Compare March 25, 2025 16:59
@timoffex timoffex changed the base branch from graphite-base/9562 to 03-24-multi_run_gpu_metrics March 25, 2025 16:59
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 24e7034 to d88c9af Compare March 25, 2025 17:11
@timoffex timoffex force-pushed the 03-24-multi_run_gpu_metrics branch from d5de854 to 100ae87 Compare March 25, 2025 17:11
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from d88c9af to 785422b Compare March 25, 2025 17:40
@timoffex timoffex requested a review from kptkin March 25, 2025 20:41
@timoffex timoffex changed the base branch from 03-24-multi_run_gpu_metrics to graphite-base/9562 March 25, 2025 20:41
timoffex added a commit that referenced this pull request Mar 25, 2025
Factors out a `monitor.GPUResourceManager` struct that manages the `gpu_stats` process and whose lifetime is tied to the server rather than a run.

When using `reinit="allow"` (added in #9562) only a single `gpu_stats` process is started up when multiple runs are active. The process is killed when no runs are active.
@timoffex timoffex force-pushed the graphite-base/9562 branch from 100ae87 to 526f4dc Compare March 25, 2025 20:56
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 785422b to ca7aeea Compare March 25, 2025 20:56
@graphite-app graphite-app bot changed the base branch from graphite-base/9562 to main March 25, 2025 20:57
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from ca7aeea to 0da4fca Compare March 25, 2025 20:57
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 0da4fca to 723fc52 Compare April 8, 2025 22:10
@timoffex timoffex changed the title feat: reinit="allow" setting for multiple simultaneous runs in one process feat: reinit="create_new" setting for multiple simultaneous runs in one process Apr 8, 2025
@timoffex timoffex changed the base branch from main to graphite-base/9562 April 11, 2025 18:21
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 723fc52 to 83d8071 Compare April 11, 2025 18:22
@timoffex timoffex changed the base branch from graphite-base/9562 to 04-10-wandb_magic_cleanup April 11, 2025 18:22
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 83d8071 to 296a658 Compare April 11, 2025 18:31
@timoffex timoffex force-pushed the 04-10-wandb_magic_cleanup branch from b780b89 to a4587d3 Compare April 11, 2025 18:31
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 296a658 to 148c3f6 Compare April 11, 2025 22:22
@timoffex timoffex force-pushed the 03-05-reinit_allow branch from 148c3f6 to 867ad2f Compare April 11, 2025 22:55
@timoffex timoffex requested a review from a team as a code owner April 11, 2025 22:55
@timoffex timoffex changed the base branch from 04-10-wandb_magic_cleanup to graphite-base/9562 April 16, 2025 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants