fix: BioEngineDatasets lazily re-resolves data server URL (0.9.3) by nilsmechtel · Pull Request #71 · aicell-lab/bioengine

nilsmechtel · 2026-05-18T14:52:47Z

Summary

Resolves the reported issue where BioEngineDatasets permanently locks in service_url = None if the discovery file ~/.bioengine/datasets/bioengine_current_server is missing at construction time, and never re-checks even after the data server starts and writes the file. Long-lived injected clients in Ray Serve deployments (e.g. chiron-manager) could not see datasets that became available after deploy.

Approach

The issue suggested two options: lazy re-read in list_datasets(), or an explicit refresh() method. I implemented both because they cover different scenarios:

Lazy resolution (_resolve_service_url) handles the common case the issue describes: the client was constructed before the data server. No caller change required. Every public method now calls _resolve_service_url() first; the fast path is a single attribute check, the slow path reads one small file.
Explicit refresh() handles the second-class case the lazy path can't: the data server has restarted on a new URL. The lazy path doesn't fire while service_url is already set, so it can't detect a server change. refresh() closes the old httpx.AsyncClient and re-runs the resolution.

Changes

bioengine/datasets/datasets.py
- Store the user's request verbatim on self._requested_data_server_url ("auto" / explicit URL / None) instead of collapsing it at init.
- Add _resolve_service_url(*, initial=False) — lazy attach. Re-read the discovery file when service_url is None and the request was "auto". Initial-call warning only at construction so a never-attached client doesn't spam logs.
- Add async refresh() — close old client, clear state, re-resolve. Returns the new URL or None.
- Move the discovery file path to _CURRENT_SERVER_FILE class attribute so the new code and tests can target it uniformly (monkeypatchable).
- Call _resolve_service_url() at the top of every public method touching service_url: ping_data_server, list_datasets, list_files, save_file, list_saved_files, get_saved_file. get_file benefits transitively via list_datasets.
tests/test_datasets_lazy_resolution.py — six focused unit tests, no network, no data server required. Cover the lazy path, the explicit-None-stays-disabled invariant, the explicit-URL-doesn't-re-resolve invariant, and refresh() swapping URLs.
pyproject.toml: 0.9.1 → 0.9.3. 0.9.2 is reserved by PR fix: auto-reconnect Ray Client on stale Serve handle (0.9.2) #70 (Ray Client stale-handle reconnect, still open). The two PRs are independent; whichever lands first takes the lower version, the other rebases.

Out of scope

Per-call discovery-file polling has a steady-state cost of one filesystem stat per dataset operation (cheap; microseconds). I did not add an in-memory negative-cache because the operation count is bounded by what the deployment actually does, and a never-resolved client only retries when something would have queried datasets anyway.
No locking around the resolve path. The current BioEngineDatasets is single-event-loop; concurrent calls from within one loop cannot race in the sync resolve method. Adding a lock is straightforward if a future caller introduces threads.

Test plan

python -m pytest tests/test_datasets_lazy_resolution.py -v --noconftest — 6 passed.
CI: docker-publish.yml accepts 0.9.3 > 0.9.1 (latest GHCR) and builds the image.
Reproduce the original incident on a worker: deploy a BioEngineDatasets-using app with no data server, start the data server, call list_datasets() — should now return the populated dictionary without redeploying.

🤖 Generated with Claude Code

BioEngineDatasets resolved the data server URL exactly once, at construction time, by reading ~/.bioengine/datasets/bioengine_current_server. If the file did not exist when the instance was created (typical for a Ray Serve deployment that starts before the data server), service_url was set to None and every subsequent dataset operation short-circuited — even after the data server later wrote the file. The only escape was to recreate the client, which a long-lived injected deployment cannot do. Fix: * Remember the user's intent (``"auto"`` / explicit URL / ``None``) on ``self._requested_data_server_url`` instead of collapsing it at init. * Add ``_resolve_service_url()`` — a cheap "are we attached yet?" check that re-reads the discovery file the first time service_url is still None on any call. ``"auto"`` keeps re-trying until the file appears; explicit URLs are attached once; ``None`` stays disabled. * Call ``_resolve_service_url()`` at the top of every public method that uses ``service_url`` (``ping_data_server``, ``list_datasets``, ``list_files``, ``save_file``, ``list_saved_files``, ``get_saved_file``; ``get_file`` benefits transitively via ``list_datasets``). * Add ``async refresh()`` — explicit re-discovery for the second scenario the original code couldn't handle: the data server has restarted on a different URL. Closes the old httpx.AsyncClient before re-attaching. * Move the discovery file path to ``_CURRENT_SERVER_FILE`` class attribute so the new code and tests can target it uniformly (monkeypatchable). Adds tests/test_datasets_lazy_resolution.py — six focused unit tests that exercise the lazy path without touching the network. Bumps version to 0.9.3 (0.9.2 is reserved by PR #70 — the Ray Client stale-handle fix). Both PRs are independent; whichever lands first takes the lower version, the other rebases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nilsmechtel and others added 2 commits May 18, 2026 16:52

Merge branch 'main' into feat/datasets-lazy-server-discovery

9fc91b2

nilsmechtel merged commit 141f4c7 into main May 18, 2026
1 check passed

nilsmechtel deleted the feat/datasets-lazy-server-discovery branch May 18, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: BioEngineDatasets lazily re-resolves data server URL (0.9.3)#71

fix: BioEngineDatasets lazily re-resolves data server URL (0.9.3)#71
nilsmechtel merged 2 commits into
mainfrom
feat/datasets-lazy-server-discovery

nilsmechtel commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nilsmechtel commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Changes

Out of scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nilsmechtel commented May 18, 2026 •

edited

Loading