-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate dataset #148
Validate dataset #148
Conversation
src/scitacean/client.py
Outdated
finalized_model = self.scicat.create_dataset_model( | ||
dataset.make_upload_model() | ||
) | ||
finalized_model = self.scicat.create_dataset_model(dset_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically wrong. dataset
got changed above to include the uploaded file info. So you need to generate a new model here. The only difference this makes is that the size may change. Because differences in the filesystem of remote and local can lead to different file sizes. And the old code used the remote size, but the new code uses the local size.
This doesn't get caught by the tests because they only run on one filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Do we still want to validate before self._connect_for_file_upload(dataset)
as we are now, or can we defer validation to here so that we only make the model once?
I am guessing making the model is cheap enough that this is probably not an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cheap compared to http requests. So I wouldn't worry.
You need to make the model for upload after the file upload. So it would then defer validation until then which is the opposite of what we want.
operation="validate_dataset_model", | ||
) | ||
if not response["valid"]: | ||
raise ValueError(f"Dataset {dset} did not pass validation in SciCat.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SciCat return details about what fields failed validation? It does when you try to upload. So it would be good to show those here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not, we checked but it only returns True
or False
. Yes, extra info on what part of the validation failed would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a feature we should request from Scicat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. To be honest, I'm hoping to get them to implement a transaction feature. Then we might not even need this extra validation step. I'm thinking through how that could work and will open an issue with a lot of details eventually.
Let's leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a transaction feature in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A way to make either all uploads (dataset, datablocks, attachments, files) succeed or all fail so we don't end up with partially uploaded data. create_new_dataset_now
attempts to work in this way but is limited because of the SciCat API. I'm hoping to get a feature that lets us do this better. I's too complicated to explain here, though.
tests/client/dataset_client_test.py
Outdated
@@ -65,6 +65,13 @@ def test_create_dataset_model(scicat_client, derived_dataset): | |||
assert expected == dict(downloaded)[key], f"key = {key}" | |||
|
|||
|
|||
def test_validate_dataset_model(real_client, require_scicat_backend, derived_dataset): | |||
real_client.scicat.validate_dataset_model(derived_dataset) | |||
derived_dataset.type = "banana" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a different field. E.g., set the contactEmail
to something that is not an email address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test of upload_new_dataset_now
that makes use of a disabled validation function in the fake client?
There should already be a test for a failed validation, but that checks for failure in the __init__
of the model.
operation="validate_dataset_model", | ||
) | ||
if not response["valid"]: | ||
raise ValueError(f"Dataset {dset} did not pass validation in SciCat.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to a pydantic.ValidationError
to be in line with the local validation? Or is there a reason to use a different type?
…use contactEmail instead of type in test
Fixes #59.
Add method to the scicat client that does validation on the SciCat server.
Call it before uploading dataset.