Skip to content

fix: BioEngineDatasets lazily re-resolves data server URL (0.9.3)#71

Merged
nilsmechtel merged 2 commits into
mainfrom
feat/datasets-lazy-server-discovery
May 18, 2026
Merged

fix: BioEngineDatasets lazily re-resolves data server URL (0.9.3)#71
nilsmechtel merged 2 commits into
mainfrom
feat/datasets-lazy-server-discovery

Conversation

@nilsmechtel
Copy link
Copy Markdown
Collaborator

@nilsmechtel nilsmechtel commented May 18, 2026

Summary

Resolves the reported issue where BioEngineDatasets permanently locks in service_url = None if the discovery file ~/.bioengine/datasets/bioengine_current_server is missing at construction time, and never re-checks even after the data server starts and writes the file. Long-lived injected clients in Ray Serve deployments (e.g. chiron-manager) could not see datasets that became available after deploy.

Approach

The issue suggested two options: lazy re-read in list_datasets(), or an explicit refresh() method. I implemented both because they cover different scenarios:

  • Lazy resolution (_resolve_service_url) handles the common case the issue describes: the client was constructed before the data server. No caller change required. Every public method now calls _resolve_service_url() first; the fast path is a single attribute check, the slow path reads one small file.
  • Explicit refresh() handles the second-class case the lazy path can't: the data server has restarted on a new URL. The lazy path doesn't fire while service_url is already set, so it can't detect a server change. refresh() closes the old httpx.AsyncClient and re-runs the resolution.

Changes

  • bioengine/datasets/datasets.py

    • Store the user's request verbatim on self._requested_data_server_url ("auto" / explicit URL / None) instead of collapsing it at init.
    • Add _resolve_service_url(*, initial=False) — lazy attach. Re-read the discovery file when service_url is None and the request was "auto". Initial-call warning only at construction so a never-attached client doesn't spam logs.
    • Add async refresh() — close old client, clear state, re-resolve. Returns the new URL or None.
    • Move the discovery file path to _CURRENT_SERVER_FILE class attribute so the new code and tests can target it uniformly (monkeypatchable).
    • Call _resolve_service_url() at the top of every public method touching service_url: ping_data_server, list_datasets, list_files, save_file, list_saved_files, get_saved_file. get_file benefits transitively via list_datasets.
  • tests/test_datasets_lazy_resolution.py — six focused unit tests, no network, no data server required. Cover the lazy path, the explicit-None-stays-disabled invariant, the explicit-URL-doesn't-re-resolve invariant, and refresh() swapping URLs.

  • pyproject.toml: 0.9.10.9.3. 0.9.2 is reserved by PR fix: auto-reconnect Ray Client on stale Serve handle (0.9.2) #70 (Ray Client stale-handle reconnect, still open). The two PRs are independent; whichever lands first takes the lower version, the other rebases.

Out of scope

  • Per-call discovery-file polling has a steady-state cost of one filesystem stat per dataset operation (cheap; microseconds). I did not add an in-memory negative-cache because the operation count is bounded by what the deployment actually does, and a never-resolved client only retries when something would have queried datasets anyway.
  • No locking around the resolve path. The current BioEngineDatasets is single-event-loop; concurrent calls from within one loop cannot race in the sync resolve method. Adding a lock is straightforward if a future caller introduces threads.

Test plan

  • python -m pytest tests/test_datasets_lazy_resolution.py -v --noconftest — 6 passed.
  • CI: docker-publish.yml accepts 0.9.3 > 0.9.1 (latest GHCR) and builds the image.
  • Reproduce the original incident on a worker: deploy a BioEngineDatasets-using app with no data server, start the data server, call list_datasets() — should now return the populated dictionary without redeploying.

🤖 Generated with Claude Code

nilsmechtel and others added 2 commits May 18, 2026 16:52
BioEngineDatasets resolved the data server URL exactly once, at
construction time, by reading
~/.bioengine/datasets/bioengine_current_server. If the file did not
exist when the instance was created (typical for a Ray Serve deployment
that starts before the data server), service_url was set to None and
every subsequent dataset operation short-circuited — even after the
data server later wrote the file. The only escape was to recreate the
client, which a long-lived injected deployment cannot do.

Fix:

* Remember the user's intent (``"auto"`` / explicit URL / ``None``) on
  ``self._requested_data_server_url`` instead of collapsing it at init.
* Add ``_resolve_service_url()`` — a cheap "are we attached yet?" check
  that re-reads the discovery file the first time service_url is still
  None on any call. ``"auto"`` keeps re-trying until the file appears;
  explicit URLs are attached once; ``None`` stays disabled.
* Call ``_resolve_service_url()`` at the top of every public method that
  uses ``service_url`` (``ping_data_server``, ``list_datasets``,
  ``list_files``, ``save_file``, ``list_saved_files``, ``get_saved_file``;
  ``get_file`` benefits transitively via ``list_datasets``).
* Add ``async refresh()`` — explicit re-discovery for the second scenario
  the original code couldn't handle: the data server has restarted on a
  different URL. Closes the old httpx.AsyncClient before re-attaching.
* Move the discovery file path to ``_CURRENT_SERVER_FILE`` class
  attribute so the new code and tests can target it uniformly
  (monkeypatchable).

Adds tests/test_datasets_lazy_resolution.py — six focused unit tests
that exercise the lazy path without touching the network.

Bumps version to 0.9.3 (0.9.2 is reserved by PR #70 — the Ray Client
stale-handle fix). Both PRs are independent; whichever lands first
takes the lower version, the other rebases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nilsmechtel nilsmechtel merged commit 141f4c7 into main May 18, 2026
1 check passed
@nilsmechtel nilsmechtel deleted the feat/datasets-lazy-server-discovery branch May 18, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant