Validate dataset #148

jokasimr · 2023-09-29T10:55:05Z

Fixes #59.

Add method to the scicat client that does validation on the SciCat server.
Call it before uploading dataset.

jl-wynen · 2023-09-29T11:38:57Z

src/scitacean/client.py

-                finalized_model = self.scicat.create_dataset_model(
-                    dataset.make_upload_model()
-                )
+                finalized_model = self.scicat.create_dataset_model(dset_model)


This is technically wrong. dataset got changed above to include the uploaded file info. So you need to generate a new model here. The only difference this makes is that the size may change. Because differences in the filesystem of remote and local can lead to different file sizes. And the old code used the remote size, but the new code uses the local size.

This doesn't get caught by the tests because they only run on one filesystem.

Ok. Do we still want to validate before self._connect_for_file_upload(dataset) as we are now, or can we defer validation to here so that we only make the model once?

I am guessing making the model is cheap enough that this is probably not an issue?

It's cheap compared to http requests. So I wouldn't worry.

You need to make the model for upload after the file upload. So it would then defer validation until then which is the opposite of what we want.

jl-wynen · 2023-09-29T11:41:03Z

src/scitacean/client.py

+            operation="validate_dataset_model",
+        )
+        if not response["valid"]:
+            raise ValueError(f"Dataset {dset} did not pass validation in SciCat.")


Does SciCat return details about what fields failed validation? It does when you try to upload. So it would be good to show those here as well.

It does not, we checked but it only returns True or False. Yes, extra info on what part of the validation failed would be nice.

Is that a feature we should request from Scicat?

Possibly. To be honest, I'm hoping to get them to implement a transaction feature. Then we might not even need this extra validation step. I'm thinking through how that could work and will open an issue with a lot of details eventually.

Let's leave it as is for now.

What is a transaction feature in this situation?

A way to make either all uploads (dataset, datablocks, attachments, files) succeed or all fail so we don't end up with partially uploaded data. create_new_dataset_now attempts to work in this way but is limited because of the SciCat API. I'm hoping to get a feature that lets us do this better. I's too complicated to explain here, though.

jl-wynen · 2023-09-29T11:41:49Z

tests/client/dataset_client_test.py

@@ -65,6 +65,13 @@ def test_create_dataset_model(scicat_client, derived_dataset):
            assert expected == dict(downloaded)[key], f"key = {key}"


+def test_validate_dataset_model(real_client, require_scicat_backend, derived_dataset):
+    real_client.scicat.validate_dataset_model(derived_dataset)
+    derived_dataset.type = "banana"


Please use a different field. E.g., set the contactEmail to something that is not an email address.

jl-wynen · 2023-09-29T11:42:40Z

tests/client/dataset_client_test.py

Can you also add a test of upload_new_dataset_now that makes use of a disabled validation function in the fake client?
There should already be a test for a failed validation, but that checks for failure in the __init__ of the model.

jl-wynen · 2023-09-29T11:43:47Z

src/scitacean/client.py

+            operation="validate_dataset_model",
+        )
+        if not response["valid"]:
+            raise ValueError(f"Dataset {dset} did not pass validation in SciCat.")


Can you change this to a pydantic.ValidationError to be in line with the local validation? Or is there a reason to use a different type?

…use contactEmail instead of type in test

… fails

nvaytet and others added 4 commits September 29, 2023 11:09

add validate method on scicat client and add initial test

86c1494

fix

082795d

use dataset model instead of dataset

f5a772f

add the same method to the fake client

e3b82d9

jokasimr requested a review from jl-wynen September 29, 2023 10:55

jokasimr assigned jokasimr and nvaytet Sep 29, 2023

jl-wynen reviewed Sep 29, 2023

View reviewed changes

jl-wynen added the sprint-scipp-2023-09 Sprint of the Scipp team label Sep 29, 2023

nvaytet and others added 5 commits September 29, 2023 13:58

make new model before creating dataset model, raise ValidationError, …

a7db63a

…use contactEmail instead of type in test

Apply automatic formatting

05fa5f1

add test that no dataset is made when trying to upload but validation…

0c456c5

… fails

use ValueError instead of ValidationError

8f2f174

remove unused imports

eb5ff88

jl-wynen approved these changes Sep 29, 2023

View reviewed changes

Merge branch 'main' into validate-dataset

329ea0d

jl-wynen merged commit 88b3318 into main Oct 2, 2023
12 checks passed

jl-wynen deleted the validate-dataset branch October 2, 2023 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate dataset #148

Validate dataset #148

jokasimr commented Sep 29, 2023

jl-wynen Sep 29, 2023

nvaytet Sep 29, 2023

jl-wynen Sep 29, 2023

jl-wynen Sep 29, 2023

nvaytet Sep 29, 2023

jl-wynen Sep 29, 2023

nvaytet Sep 29, 2023

jl-wynen Sep 29, 2023

jokasimr Sep 29, 2023

jl-wynen Oct 2, 2023

jl-wynen Sep 29, 2023

jl-wynen Sep 29, 2023

jl-wynen Sep 29, 2023

Validate dataset #148

Validate dataset #148

Conversation

jokasimr commented Sep 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment