-
Notifications
You must be signed in to change notification settings - Fork 14
Read directly from remote paths, upload Dataset with reserve_manual_upload #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…o s3_remote_datasets
…o s3_remote_datasets
…set_upload_to_paths_for_preliminary
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
…sos-libs into s3_remote_datasets
@@ -0,0 +1,32 @@ | |||
import webknossos as wk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import as wk or import directly. I wouldn't mix both.
@@ -0,0 +1,32 @@ | |||
import webknossos as wk | |||
from webknossos import LayerToLink | |||
from webknossos.dataset.remote_folder import RemoteFolder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from webknossos.dataset.remote_folder import RemoteFolder | |
from webknossos import RemoteFolder |
) | ||
copytree(attachment.path, new_path) | ||
self._add_attachment(new_attachment) | ||
if self._layer.path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define a bool for virtual/remote datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered making subclasses for remote layers?
_UNSET = make_sentinel("UNSET", var_name="_UNSET") | ||
|
||
|
||
class LayerToLink(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would lead to cyclic imports as LayerToLink requires dataset and dataset requires LayerToLink
def announce_manual_upload( | ||
cls, | ||
def publish_to_preliminary_dataset( | ||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the naming a bit confusing. a) I wouldn't use publish
, because that is not really used in wklibs. Better upload
. b) Is this announcing a dataset upload or moving files after announcing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the name could be better. What we do here is, announcing the dataset properties to wk, so that it can reserve the necessary paths. We get back the paths and upload/copy/symlink the data. Finally, we let wk now that the data is now available.
""" | ||
Iterates over the mags and attachments and copies or symlinks them to the target location. | ||
""" | ||
# This needs to be set to make sure the encoding is not chunked when uploading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit surprising here. We can do this in the lib init.
…8844) ### Description This PR changes how clients with direct access to the underlying data storage can get datasets registered into WK. Corresponding libs client PR: scalableminds/webknossos-libs#1359 <details> <summary>New ReserveDatasetUploadToPaths Protocol</summary> - Client sends reserveDatasetUploadToPaths request to wk side, including a preliminary datasource plus layersToLink. - Server generates id and paths for all mags and attachments where the client must put the stuff. This is returned to the client. - Server also validates layersToLink, and creates a dataset in the database, marked as not yet fully uploaded to the target paths. - After the client has actually put the data there, it calls finishDatasetUploadToPath, marking the dataset as complete. It can now be used. - Note that layersToLink here are by datasetId+layerName, but for normal dataset upload it’s still datasetDirectoryName+layerName (we’ll change that to id later) </details> <details> <summary>New ReserveAttachmentUploadToPath Protocol</summary> - Client sends reserveAttachmentUploadToPath request with dataset id, attachment name, type etc. WK returns the path where the client must put the attachment. Wk inserts this path in the DB, marked as not yet fully uploaded - After the client has actually put the data there, it calls finishAttachmentUploadToPath with the same parameters (as there is no attachment id), marking the attachment as complete. It can now be used. </details> <details> <summary>New ReserveUploadToPathForPreliminary Protocol</summary> This extra route supports a new workflow, which will be used only by the convert_to_wkw worker job, which webknossos starts when a user uploads a dataset that is not yet in wkw/zarr format. When starting this upload, a dataset is already inserted in the database (via reserveUpload – not manual). Then after finishUpload, the conversion worker job is started. Now after the worker has converted the dataset to wkw/zarr, it needs to call this new reserveUploadToPathsForPreliminary route, which works only on datasets in this particular state. It returns paths, just like the normal, user-facing reserveUploadToPaths described above. But it is different in that it needs to operate on the datasetId of the already-created, preliminary dataset from the original reserveUpload. </details> <details> <summary>Changes to Editing DataSources</summary> - Dataset settings view no longer allows changing layer names - It also no longer allows editing the json directly (textarea is hidden for now, to be removed in follow-up) - The backend now only applies the allowed changes on save. (category, boundingBox, coordinatesTransformations, defaultViewConfiguration, adminViewConfiguration, largestSegmentId, voxelSize, deleted layers) </details> <details> <summary>Misc Changes</summary> - localDirectoryWhitelist is no longer checked in data loading, only in add/exploreAndAdd/upload - exploreAndAdd now returns dataset id - full dataSource now included in dataset publicWrites, with source data paths if requested by includePaths </details> <details> <summary>Refactoring: UPath</summary> - New class UPath handles paths that can be either remote (s3, https, gcs) or local (file or no schem) - Attachments and mags have this UPath class now instead of URI and String respectively - VaultPath also wraps a UPath (plus its DataVault with credentials) </details> <details> <summary>Refactoring: New DataLayer Class Hierarchy</summary> - Simplified DataSource and DataLayer classes. <img width="800" alt="image" src="https://github.com/user-attachments/assets/4de47da2-ffae-4149-8e70-4653ca86d630" /> </details> ### Steps to test: <details> <summary>Test against regressions</summary> - Browse some data, annotate some, explore remote dataset, explore local dataset in directoryWhitelist, should all still work - Test that both the newest and also older libs versions can still open_remote and list existing datasets (including ones with unusual data, like remote N5) </details> <details> <summary>Test new features</summary> - With new libs client code, go through the reserveUploadToPaths flow scalableminds/webknossos-libs#1359 - Also test reserveAttachmentUploadToPath via libs - Test that libs can still upload a dataset. - Attempt to change paths of an existing dataset in wk, or add a dataset with paths outside of itself, should not be possible (security!) - Change the application.conf `webKnossos.datasets.uploadToPathsPrefixes` to our managed s3 object storage, give both wk and the client credentials, should also work with this. </details> ### TODOs: <details> <summary>Backend</summary> - [x] Decide how the directory name should look: just the id? or prefix-id? (might get outdated on renamings) - [x] is the reserve response the full datasource again? or just a path prefix? - [x] how to configure this in the dev setup - [x] clean up the whole DataSource mess? - [x] ensure we include resolutions; - [x] never leak credentialIds (except during explore/add flow); - [x] settings view reads InboxDataSource, which includes credentialIds. - [x] Needs new Api for changing relevant fields - [x] compact json version? (resolutions, no paths?) - [x] do not write resolutions to disk? - [x] directory scan overwrites status to No longer available on the datastore - [x] reserveAttachmentUploadToPaths - [x] what’s up with the zarr streaming snapshots? - [x] evolutions - [x] does upload still check unique directory name? - [x] ensure that add route still checks localDirectoryWhiteList (also for exploreAndAdd) - [x] move stuff to DatasetService, make createDataset private again - [x] re-test upload with layersToLink from libs - [x] return readable failures on now-unsupported routes? (old datastore-based reserveManualUpload) - [x] do not require status field in datasource in reserveUpload request body. - [x] introduce api version 11 - [x] re-test that editing DS in DB + reload propagates to datastore - [x] on finishAttachmentUploadToPath, update datasource on disk too if isVirtual=False. - [x] datasource update call via wk backend instead of frontend->datastore? skip datastore if isVirtual=true - [x] more unit tests for UPath - [x] double check that dataset info route is backwards compatible for libs - [x] add remote UI layer name no longer editable, but explore returns empty, that’s a problem. test with first entry from notion remote datasets table. </details> <details> <summary>Frontend</summary> - [x] dataset settings view: include paths, remove advanced view - [x] updating datasource should go via wk first </details> ### Follow-Ups <details> <summary>Possible Follow-Up Issues</summary> - #8939 - #8940 - #8941 - #8942 - #8943 - #8944 - #8945 - #8946 </details> ### Issues: - fixes #8827 - fixes #8859 - fixes #8857 - fixes #8851 - fixes #8762 - fixes #8881 - fixes #8929 - contributes to #8749 ------ - [x] Added changelog entry (create a `$PR_NUMBER.md` file in `unreleased_changes` or use `./tools/create-changelog-entry.py`) - [x] Added migration guide entry if applicable (edit the same file as for the changelog) - [ ] Adapted [wk-libs python client](https://github.com/scalableminds/webknossos-libs/tree/master/webknossos/webknossos/client) if relevant API parts change scalableminds/webknossos-libs#1359 - [x] Removed dev-only changes like prints and application.conf edits - [x] Considered [common edge cases](../blob/master/.github/common_edge_cases.md) - [x] Needs datastore update after deployment --------- Co-authored-by: valentin-pinkau <valentin.pinkau@scalableminds.com>
This reverts commit 3fed75a.
Description:
Issues:
Todos:
Make sure to delete unnecessary points or to check all before merging: