Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate dataset #148

Merged
merged 10 commits into from
Oct 2, 2023
Merged

Validate dataset #148

merged 10 commits into from
Oct 2, 2023

Conversation

jokasimr
Copy link
Collaborator

Fixes #59.

Add method to the scicat client that does validation on the SciCat server.
Call it before uploading dataset.

finalized_model = self.scicat.create_dataset_model(
dataset.make_upload_model()
)
finalized_model = self.scicat.create_dataset_model(dset_model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically wrong. dataset got changed above to include the uploaded file info. So you need to generate a new model here. The only difference this makes is that the size may change. Because differences in the filesystem of remote and local can lead to different file sizes. And the old code used the remote size, but the new code uses the local size.

This doesn't get caught by the tests because they only run on one filesystem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Do we still want to validate before self._connect_for_file_upload(dataset) as we are now, or can we defer validation to here so that we only make the model once?

I am guessing making the model is cheap enough that this is probably not an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cheap compared to http requests. So I wouldn't worry.

You need to make the model for upload after the file upload. So it would then defer validation until then which is the opposite of what we want.

operation="validate_dataset_model",
)
if not response["valid"]:
raise ValueError(f"Dataset {dset} did not pass validation in SciCat.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SciCat return details about what fields failed validation? It does when you try to upload. So it would be good to show those here as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not, we checked but it only returns True or False. Yes, extra info on what part of the validation failed would be nice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a feature we should request from Scicat?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. To be honest, I'm hoping to get them to implement a transaction feature. Then we might not even need this extra validation step. I'm thinking through how that could work and will open an issue with a lot of details eventually.

Let's leave it as is for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a transaction feature in this situation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to make either all uploads (dataset, datablocks, attachments, files) succeed or all fail so we don't end up with partially uploaded data. create_new_dataset_now attempts to work in this way but is limited because of the SciCat API. I'm hoping to get a feature that lets us do this better. I's too complicated to explain here, though.

@@ -65,6 +65,13 @@ def test_create_dataset_model(scicat_client, derived_dataset):
assert expected == dict(downloaded)[key], f"key = {key}"


def test_validate_dataset_model(real_client, require_scicat_backend, derived_dataset):
real_client.scicat.validate_dataset_model(derived_dataset)
derived_dataset.type = "banana"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a different field. E.g., set the contactEmail to something that is not an email address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test of upload_new_dataset_now that makes use of a disabled validation function in the fake client?
There should already be a test for a failed validation, but that checks for failure in the __init__ of the model.

operation="validate_dataset_model",
)
if not response["valid"]:
raise ValueError(f"Dataset {dset} did not pass validation in SciCat.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to a pydantic.ValidationError to be in line with the local validation? Or is there a reason to use a different type?

@jl-wynen jl-wynen added the sprint-scipp-2023-09 Sprint of the Scipp team label Sep 29, 2023
@jl-wynen jl-wynen merged commit 88b3318 into main Oct 2, 2023
12 checks passed
@jl-wynen jl-wynen deleted the validate-dataset branch October 2, 2023 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sprint-scipp-2023-09 Sprint of the Scipp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate dataset with backend before upload
3 participants