-
Notifications
You must be signed in to change notification settings - Fork 29
Virtual Datasets #8708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual Datasets #8708
Conversation
📝 WalkthroughWalkthroughThis change introduces "virtual datasets" by shifting dataset identification and management from organization/directory-based keys to a unified dataset ID ( Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes
Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
f7de002
to
7dcad67
Compare
7dcad67
to
42101a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)
235-247
: Caching implementation follows best practicesThe caching implementation with 5-minute TTL is appropriate for dataset ID lookups.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
694-716
: Cache invalidation properly implementedThe method correctly invalidates the cache after updating the remote data source (line 711), addressing the previous review comment about stale cache issues.
🧹 Nitpick comments (3)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)
144-156
: Consider adding response validationThe method extracts
response.body
directly without validating if it's a valid dataset ID format. Consider parsing it as anObjectId
to ensure validity.- datasetId = response.body + datasetId <- ObjectId.fromString(response.body).toFox ?~> "Invalid dataset ID in response"webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/LegacyController.scala (1)
457-499
: Consider adding cache invalidation after disk loadIn the
loadFromDisk
function, after successfully updating the data source repository, the dataset cache should be invalidated to ensure subsequent operations get the fresh data.case GenericDataSource(_, _, _, _) => for { _ <- dataSourceRepository.updateDataSource(dataSource) + _ = datasetCache.invalidateCache(datasetId) } yield Ok(Json.toJson(dataSource))
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
371-381
: Consider more specific error messagesThe generic error message "dataset.add.failed" could be more specific to help users understand what went wrong during registration.
- datasetId <- dsRemoteWebknossosClient.registerDataSource(dataSource, dataSourceId, folderId) ?~> "dataset.add.failed" + datasetId <- dsRemoteWebknossosClient.registerDataSource(dataSource, dataSourceId, folderId) ?~> "dataset.add.failed.registration"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
(18 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/LegacyController.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
(5 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in the `updatemags` method of datasetmagsdao (scala), the code handles different dataset types disti...
Learnt from: frcroth
PR: scalableminds/webknossos#8609
File: app/models/dataset/Dataset.scala:753-775
Timestamp: 2025-05-12T13:07:29.637Z
Learning: In the `updateMags` method of DatasetMagsDAO (Scala), the code handles different dataset types distinctly:
1. Non-WKW datasets have `magsOpt` populated and use the first branch which includes axisOrder, channelIndex, and credentialId.
2. WKW datasets will have `wkwResolutionsOpt` populated and use the second branch which includes cubeLength.
3. The final branch is a fallback for legacy data.
This ensures appropriate fields are populated for each dataset type.
Applied to files:
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/LegacyController.scala
📚 Learning: in scala's for-comprehension with fox (future-like type), the `<-` operator ensures sequential execu...
Learnt from: MichaelBuessemeyer
PR: scalableminds/webknossos#8352
File: app/models/organization/CreditTransactionService.scala:0-0
Timestamp: 2025-01-27T12:06:42.865Z
Learning: In Scala's for-comprehension with Fox (Future-like type), the `<-` operator ensures sequential execution. If any step fails, the entire chain short-circuits and returns early, preventing subsequent operations from executing. This makes it safe to perform validation checks before database operations.
Applied to files:
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
📚 Learning: in the webknossos codebase, classes extending `foximplicits` have access to an implicit conversion f...
Learnt from: frcroth
PR: scalableminds/webknossos#8236
File: webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/mesh/MeshFileService.scala:170-173
Timestamp: 2025-04-23T08:51:57.756Z
Learning: In the webknossos codebase, classes extending `FoxImplicits` have access to an implicit conversion from `Option[A]` to `Fox[A]`, where `None` is converted to an empty Fox that fails gracefully in for-comprehensions.
Applied to files:
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: build-smoketest-push
- GitHub Check: backend-tests
- GitHub Check: frontend-tests
🔇 Additional comments (6)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (2)
53-57
: LGTM!The
DataSourceRegistrationInfo
case class is well-structured and properly implements JSON serialization.
130-133
: LGTM!The migration to use
ObjectId
for dataset identification is correctly implemented.webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/LegacyController.scala (1)
66-78
: Legacy route implementation pattern is consistent and secureThe implementation correctly validates access using the legacy DataSourceId before converting to the new ObjectId format. The error handling with Fox ensures proper propagation of failures.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (3)
353-367
: Update method correctly handles both disk and virtual datasetsThe implementation appropriately checks if the dataset exists on disk before deciding whether to update locally or remotely. Preserving the original datasource ID ensures consistency.
412-430
: Cache clearing refactored efficientlyThe method now accepts an
InboxDataSource
parameter, avoiding redundant disk reads. All relevant caches are properly cleared.
443-463
: Delete operation correctly handles virtual datasetsThe method appropriately distinguishes between disk-based and virtual datasets, using the new
deleteVirtualDataset
API for datasets that don't exist on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s go!
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
✅ UTG Post-Process Complete No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully. |
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |
#8708 missed to update the frontend part of the `positionForSegment` route. I checked this in the changes of #8708. This should fix this. Furthermore, I did investige some other potential routes and did not find any other being broken. But I did not a full search, just some other related routes in `datastore.latest.routes`. ### URL of deployed dev instance (used for testing): - https://___.webknossos.xyz ### Steps to test: - Not sure how to test this reliably. I triggered the bug locally by splitting two segments that had no connecting surface in the agglomerate mesh. This lead to a crash of wk and when I investigated this, it showed that the route I fixed here was causing a 404 on the current master. ### Issues: - self noticed bug ------ (Please delete unneeded items, merge only when none are left open) - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`)
- The “scan disk for new datasets” button now only scans for the current orga. This is mostly a performance optimization for multi-orga setups - Removed the now obsolete stateful dataSourceRepository (the wk-side database is the source of truth, the datastore side should use only the data from there, with its cache) - Remaining usages now either talk to wk directly or, in the case of STL download in LegacyController, use the original controller’s implementation. - The dashboard search now supports searching for dataset ids (only active when a full ObjectId is entered in the search field) - Fixed a bug where datasets with AdditionalAxes would be assumed changed on every re-report due to the hashCode being non-deterministic. This is fixed by using Seq instead of Array for the bounds of AdditionalAxis. ### Steps to test: - Create setup with multiple orgas (e.g. with isWkOrgInstance=true) - Put datasets in multiple orgas, hit refresh button. It should scan those of the user’s current orga only. - The regular once-a-minute scan should still scan everything. - Try searching for a dataset id in the dashboard ### TODOs: - [x] Backend - [x] Take optional orga id parameter, scan only that directory - [x] Don’t unreport datasets of other orgas - [x] adapt to the changes of #8708 - [x] Frontend - [x] Adapt function in rest_api.ts - [x] Pass current organizationId when hitting the button ### Issues: - fixes #8784 ------ - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`) - [x] Removed dev-only changes like prints and application.conf edits - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md) - [x] Needs datastore update after deployment --------- Co-authored-by: Charlie Meister <charlie.meister@student.hpi.de> Co-authored-by: frcroth <frcroth@users.noreply.github.com>
With #8708 the datastore uses the datasource properties as stored in postgres. Since wk already scans for attachments not registered in the on-disk datasource-properties.json and writes those into postgres, we don’t need to re-scan when answering the attachment list routes. There we can now rely on the attachment being already registered. ### Steps to test: - Open a local dataset with hdf5 attachments that are not registered in the local datasource-properties.json - They should still be usable. ### Issues: - contributes to #8567 ------ - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md) - [x] Needs datastore update after deployment
this change was accidentally omitted in #8708
Follow-up fix for #8708 (request must be by id now) ### URL of deployed dev instance (used for testing): - https://fixadhocviewmode.webknossos.xyz ### Steps to test: - Open a dataset with a static segmentation layer, request ad-hoc mesh for a segment, should load. ### Issues: - fixes https://scm.slack.com/archives/C02H5T8Q08P/p1757351922621309?thread_ts=1757338710.530969&cid=C02H5T8Q08P ------ - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`) - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md)
URL of deployed dev instance (used for testing):
Steps to test:
Implementation notes
DataSource Id is still used internally in the datastore for various caches, also in the binary data handling.
Handling of real disk data sources is still done via orgid and datasetDirectoryName (e.g., uploading, storage size)
Everything else should use dataset ids.
TODOs:
Issues:
Followups:
(Please delete unneeded items, merge only when none are left open)
$PR_NUMBER.md
file inunreleased_changes
or use./tools/create-changelog-entry.py
)