Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial validate support, with --strict option #142

Closed
wants to merge 28 commits into from

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Dec 2, 2021

Work in progress...

I am testing the JSON schemas from ngff repo, downloading them to a cache with https://github.com/allenai/cached_path.
This PR adds a validate command that validates the JSON at each Node (currently only working for Multiscales spec).

NB: some validation occurs during the init of the Nodes e.g. axes matching xyzct etc. so if that fails then we don't even get to validate against the schemas.

We use --strict flag to use strict_ schemas to show warnings where strict rules are not met.

cc @glyg

$ ome_zarr validate https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr --strict
WARNING:ome_zarr.io:version mismatch: detected:FormatV01, requested:FormatV03
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV01
WARNING:ome_zarr.io:version mismatch: detected:FormatV01, requested:FormatV03
WARNING:ome_zarr.io:version mismatch: detected:FormatV03, requested:FormatV01
Validating Multiscales spec at https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr/ [zgroup]
Using Multiscales schema version 0.1
WARNING 'name' is a required property
WARNING 'type' is a required property
WARNING 'metadata' is a required property
Validating Multiscales spec at https://s3.embassy.ebi.ac.uk/idr/zarr/v0.1/6001247.zarr/labels/masks/ [zgroup]
Using Multiscales schema version 0.1
WARNING 'name' is a required property
WARNING 'type' is a required property
WARNING 'metadata' is a required property

@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #142 (c03b0a0) into master (d729a38) will increase coverage by 0.49%.
The diff coverage is 95.94%.

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
+ Coverage   83.90%   84.39%   +0.49%     
==========================================
  Files          12       12              
  Lines        1379     1442      +63     
==========================================
+ Hits         1157     1217      +60     
- Misses        222      225       +3     
Impacted Files Coverage Δ
ome_zarr/data.py 87.00% <ø> (ø)
ome_zarr/utils.py 90.19% <88.88%> (-0.61%) ⬇️
ome_zarr/reader.py 86.15% <97.77%> (+1.15%) ⬆️
ome_zarr/cli.py 89.77% <100.00%> (+1.16%) ⬆️
ome_zarr/format.py 97.59% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d729a38...c03b0a0. Read the comment docs.

@glyg
Copy link
Contributor

glyg commented Dec 2, 2021

Hi @will-moore thanks for the ping. I'm waiting a bit to advance on the rest of the validation, mainly because I'm not sure of what will be left to validate once you're done :)
But maybe I'm wrong and the ticked list here is all there is?

If so I'll try to get that started soon

@sbesson
Copy link
Member

sbesson commented Dec 2, 2021

Trying to capture knowledge from our historical experience with XML and XSD schemas, my understanding is that the semantics of JSON schemas is very similar on this particular issue and only distinguish between required and optional attributes. There is no real recommended concept and one reason for this is that recommended is loosely defined from a validation perspective. A few examples of open questions:

  • assuming a level exists between required and optional, is this a flat level or would some of the attributes more recommended than others?
  • is the recommended state context dependent i.e. is it the responsibility of the tool/application to make decisions on recommended attributes and report accordingly?

Brainstorming a few ideas:

  • from former discussions around minimal requirements, one idea would be to introduce different level of validations e.g. level 0 would be the absolute minimal specification (similar to https://docs.openmicroscopy.org/ome-model/latest/specifications/minimum.html), level 1 would require some attributes marked as optional in the base schema, level 2 would require more attributes...
  • is there some infrastructure that would allow someone to inject additional validation rules? The JSON schema would provide the base validation framework i.e. ensuring the syntax is valid and that the required properties are specified but e.g. some profile would allow to specify additional constraints?

@will-moore
Copy link
Member Author

@glyg I wouldn't claim that the list there is complete - It's just what I picked up from the spec, looking for MUST rules. But not every detail of the spec is spelt out (it's possible to imagine a file that passed all those rules and was still invalid for some other reason).
However, it's also likely that some of those unchecked items are already covered by ome-zarr-py. E.g. some axes validation at

if len(set(axes) - axes_values) > 0:
(when you try to parse any zarr image) and
def _validate_axes_names(
(currently only used when writing an OME-Zarr).

@sbesson The method I've used here (adding more restrictions to the schema via merging python dicts) seems to work so far and could be extended for various levels. It's still limited by the JSON schema validation, so the logic can't get too complex, but it seems worth exploring in more depth.

@will-moore
Copy link
Member Author

@sbesson I've updated the approach to match the allOf method of combining schemas adopted at ome/ngff#76

Re: required, optional and recommended: The spec uses MUST (required), SHOULD (recommended) and MAY (optional) rules. I don't think this is tool/application specific.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a substantial step forward in the implementation of a first validation API consuming the newly introduced JSON schemas. The strategy of using an allOf schema allows to validate recommended attributes in addition to the mandatory ones (covered by the default JSON schema). Interested to hear alternate implementation suggestons but from my side, this feels like a solid approach to encode and validate the standard MUST/SHOULD/MAY semantics.

Several suggestions/comments made inline on the implementation/packaging/API. The .DS_Store files also need to be removed (and added to .gitignore)

Probably beyond the scope of this PR, capturing a few thoughts about next steps:

  • should minimal validation be opt-in by default (leaving strict validation as an option)? Assuming the main drawback to doing so would be a performance impact, I suspect this will need to be profiled e.g. using complex datasets (HCS) stored remotely and look into the time spent validating vs performing other logic.
  • should validation also be added to the writing code?

ome_zarr/cli.py Outdated
# validate
parser_validate = subparsers.add_parser("validate")
parser_validate.add_argument("path")
parser_validate.add_argument("--warnings", action="store_true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the role of this argument to trigger the validation of the recommended attributes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. We have various terminology: There are Spec rules that SHOULD be followed (so it's recommended that you do). These are covered by a strict schema and if you don't then this gives you warnings (if you use that flag)!
It would be nice to settle on ONE term that could cover all these cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Trying to look into the terminology used by others, I came back to SHACL which we have not decided to adopt at this stage but we might want to come back to in the near future.

A relevant concept is the idea of severity defined here and used by the validation report. In particular, looking at the pySHACL library, the --allow-warnings option effectively controls the severity level beyond which the result will be considered as invalid.

ome_zarr/reader.py Outdated Show resolved Hide resolved
ome_zarr/reader.py Outdated Show resolved Hide resolved
ome_zarr/reader.py Outdated Show resolved Hide resolved
ome_zarr/reader.py Outdated Show resolved Hide resolved
ome_zarr/schemas.py Outdated Show resolved Hide resolved
ome_zarr/schemas.py Outdated Show resolved Hide resolved

if hasattr(node, "validate"):
node.validate(warnings)
yield node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic is duplicate from def info i.e. the graph is traversal with an action being perform at each level node.print vs node.validate.

This raises the question of whether the two commands should be combined e.g. to satisfy the use cases where someone will want to inspect an OME-Zarr, validate it and output a minimal report if the validation passes? There is likely an argument to doing both actions independently (validation or display information) but could that be equally achieved via flags (--validate/--no-validate and --report/--no-report)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you just do this: does it report by default (and validate)?

$ ome_zarr info image.zarr

Then you can disable both with: ?

$ ome_zarr info image.zarr --no-validate --no-report

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2cents here: I think the cleanest design would be to implement def visit(path: str, func: callable) that evaluates func for each node. This could then be used to implement both info and validate without much code duplication, i.e. both info and validate would just use visit with an appropriate function.

This has the added advantage that it allows further customization by users. If you go for this functionality I would also adopt the "usual" syntax for visit in python: If func returns something other than None, visit terminates and returns it.

ome_zarr/reader.py Outdated Show resolved Hide resolved
@will-moore will-moore mentioned this pull request Dec 14, 2021
@will-moore
Copy link
Member Author

This will fail checks until ome/ngff#77 is merged and released on pypi so that ngff is pip installable (now a requirement of this PR).

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two inline comments regarding the prefixing of the package/module.

I am generally in favor of being able to consume schemas from a package rather than copying/pasting and this approach will become increasingly needed as the number of consumer repositories increases and this will help with propagating updates.

Probably my biggest concern is related to the usage of github.com/ome/ngff as a single repository for the specification + the schemas + the samples. Assuming we start covering more languages, we will run into similar issues that we are already facing with github.com/ome/ome-model. That being said, I have not found a good working example of an alternate structure. Happy to hear other feedback including from the community but immediately I see gaining experience with the proposal of ome/ngff#77 while being aware that things might be split in the future.

Although we are still exploring and not 100% clear on the best approach, I assume this strategy can then be easily adapated for other languages (Java/JAR, JS/npm) if we start increasing our language coverage

@@ -8,7 +8,11 @@
import dask.array as da
import numpy as np
from dask import delayed
from jsonschema import validate as jsonschema_validate
from jsonschema.validators import validator_for
from ngff.schemas import LocalRefResolver, get_schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier today, if we go for an upstream packages containing the schemas as a resource, my vote goes towards prefixing the package/module with ome i.e. ome_ngff.schemas

setup.py Outdated
@@ -23,6 +23,8 @@ def read(fname):
install_requires += (["requests"],)
install_requires += (["scikit-image"],)
install_requires += (["toolz"],)
install_requires += (["jsonschema"],)
install_requires += (["ngff"],)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I think our project should be prefixed with ome i.e. ome-ngff inline with the differentiation between NGFFs and OME-NGFF that we settled on in the paper

Copy link
Contributor

@constantinpape constantinpape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really valuable to have this in, @will-moore. I guess this is pending ome/ngff#77. Anything else blocking here?


if hasattr(node, "validate"):
node.validate(warnings)
yield node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my 2cents here: I think the cleanest design would be to implement def visit(path: str, func: callable) that evaluates func for each node. This could then be used to implement both info and validate without much code duplication, i.e. both info and validate would just use visit with an appropriate function.

This has the added advantage that it allows further customization by users. If you go for this functionality I would also adopt the "usual" syntax for visit in python: If func returns something other than None, visit terminates and returns it.

@will-moore
Copy link
Member Author

Added visit() as suggested by @constantinpape in c37e116

@sbesson sbesson mentioned this pull request Apr 18, 2022
@joshmoore
Copy link
Member

@satra mentioned that it would be useful to add validation of Zarrs into https://github.com/dandi/dandi-cli (After I pointed out this 😄)

Thoughts on getting this into a release?

localResolver = LocalRefResolver.from_schema(strict_schema)
validator = cls(strict_schema, resolver=localResolver)
for error in validator.iter_errors(json_data):
LOGGER.warn(error.message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not familiar with the code/design pattern here, can't see if there is some MVC like pattern where the warnings&errors (the Model) are collected/returned by some validate (the Controller), and then displayed (by the Viewer).
For me, as a 3rd party user intending to use this library to validate it, would be useful to have clear separation of the Model/Controller from the Viewer which would be "mine".

Here dandi/dandi-cli#943 (comment) we are to "converge" on some data structure which we would use to cover validation reports across multiple validators possibly used for any specific file/dataset at hands. So it would be important for us to be able to collect errors/warnings/hints uniformly first and then report them via our own "viewer".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback. This work is at quite an early stage, so we haven't considered the usage of this tool outside of a cli validation. In fact, with changes to the schemas at https://github.com/ome/ngff/tree/main/0.4/schemas (include regular and "strict" schemas), there should be less logic required to validate.
You may find that you can choose the JSON schema validation tool of your choice, and don't need anything from ome-zarr-py.
In any case, this PR needs to be reviewed...
cc @sbesson

@will-moore
Copy link
Member Author

@sbesson It looks like https://ngff.openmicroscopy.org/0.4/schemas/plate.schema is 404, but this looks like the correct URL? Using https://raw.githubusercontent.com/ome/ngff/main/0.4/schemas/plate.schema for now...

@will-moore
Copy link
Member Author

With that last commit, renamed --warnings to --strict and added support for validating Wells and Plates.

$ ome_zarr validate https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0056B/7361.zarr --strict
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 4.2/4.2 kB
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 568/568 bytes
WARNING:ome_zarr.reader:'name' is a required property
WARNING:ome_zarr.reader:'maximumfieldcount' is a required property

Well:

$ ome_zarr validate https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0056B/7361.zarr/A/1/ --strict
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 1.2/1.2 kB
Downloading https://raw.githubusercontent.com/ome/ngff/main/0… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 309/309 bytes
WARNING:ome_zarr.reader:'version' is a required property

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the PR and pushing this important functionality forward.

A few questions:

  • the usage of cached_path looks like an interesting alternative in the absence of a clear path forward for packaging/distributing thef JSON schemas. What is the library behavior when a live JSON schema is amended? Is the local cached version invalidated and the new version refetched?
  • looking at the get_schema implementations for each specification, it only needs two elements: 1- the version of the specification, 2- the mapping between Spec and schema name. Could this be simplified by populating the version as part of the __init__ method since the lookup is already made there and storing the schema name as a class constant?
  • is there a way to extend the unit tests to add some coverage of the new validation infrastructure?

resolver = RefResolver.from_schema(strict_schema, store=schema_store)
validator = Validator(strict_schema, resolver=resolver)
for error in validator.iter_errors(json_data):
LOGGER.warn(error.message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to logging errors at a WARN level, what is the expected contract of self.validate(strict=True) if the JSON does not meet the strict schema? Should an error be thrown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in #142 (comment) above, I'm thinking of separating the reporting of errors from the validation itself. So, the cli validate command can log or print errors, but the Spec.validate() and utils.validate() would only return errors / messages (and maybe log at INFO level)?

It could get tricky to give the user an accurate picture of what failed: eg. when you validate an Image with Labels, the image schema is used for both, so you can get an output like this and it's hard to know which object was missing "name":

$ ome_zarr validate 6001247.zarr/ --strict --clear_cache
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property
WARNING:ome_zarr.reader:'name' is a required property
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of jsonschema.validate vs jsonschema.iter_errors, I concur with the idea of having an API that returns an iterable of all exceptions.

Will the contract be the same for the default and the strict schemas? In the current implementation, the base schema raises an exception but the strict schema does not.
Also I assume since the strict schema extends the base schema, it is not necessary to have 2 validation calls?

def get_schema(self, strict: bool = False) -> Optional[Dict]:
multiscales = self.lookup("multiscales", [])
version = multiscales[0].get("version", CurrentFormat().version)
return get_schema("image", version, strict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the scope of this PR, this probably points out the need to split image.schema into omero.schema and multiscales.schema

@will-moore
Copy link
Member Author

Thanks: currently, Multiscales spec has

            self.version = multiscales[0].get(
                "version", "0.1"
            )  # should this be matched with Format.version?

whereas elsewhere I've been using CurrentFormat().version as the default instead of "0.1". OK to update that behaviour for Multiscales to use current version as default?

@sbesson
Copy link
Member

sbesson commented Jul 5, 2022

OK to update that behaviour for Multiscales to use current version as default?

At lest from my side, yes. I assume that's the spirit of the comment in

) # should this be matched with Format.version?

I do not see a reason why Multiscales should not be treated differently from other specs. If the consensus of the library is to default to the latest published specification in the absence of version metadata, it makes sense to apply it everywhere.

While on the topic of version retrieval, I also noticed another location

def _get_metadata_version(self, metadata: dict) -> Optional[str]:
"""
Checks the metadata dict for a version
Returns the version of the first object found in the metadata,
checking for 'multiscales', 'plate', 'well' etc
"""
multiscales = metadata.get("multiscales", [])
if multiscales:
dataset = multiscales[0]
return dataset.get("version", None)
for name in ["plate", "well", "image-label"]:
obj = metadata.get(name, None)
if obj:
return obj.get("version", None)
return None
which contract is to retrieve a version and which duplicates a lot of the logic in this PR. Is there a rationale for having this in seprate places or does that point at the need for some form of Spec.get_version() API that can be consumed both by Spec.validate() and by Format.matches?

Need to create a format instance (of the correct version) since this isn't a static methods
@will-moore
Copy link
Member Author

@sbesson I've used format.get_metadata_version() (no-longer private method) in the Spec.init(). I first need to create an instance of the Format class, since this is not a static method. It could be a static method, since we don't currently have a different implementation for any of the Format subclasses, but it's possible that we might want to look-up the version differently for subclasses in the future, so I didn't change it to a static method.

@will-moore
Copy link
Member Author

@sbesson I added a --clear_cache option that removes the cached_path directory and forces a fresh download.

@will-moore will-moore changed the title Initial validate support, with --warnings option Initial validate support, with --strict option Jul 6, 2022
@will-moore will-moore closed this Apr 17, 2023
@yarikoptic
Copy link
Contributor

Is there replacement for this PR or some issue to get announcement whenever validation would become possible?

@will-moore
Copy link
Member Author

There is some validation code (using pydantic) at https://github.com/czbiohub/iohub/blob/main/iohub/ngff_meta.py but I haven't tried it myself yet.
It was mentioned in the discussion at #258.

You can also use the JSON schemas to validate JSON, e.g.

$ pip install check-jsonschema
$ check-jsonschema --schemafile https://raw.githubusercontent.com/ome/ngff/main/0.4/schemas/image.schema 6001240.zarr/.zattrs
ok -- validation done

@yarikoptic
Copy link
Contributor

Thank you for the pointers -- I will check them out! Since this PR was closed without any comment left to describe decision to close, a quick follow up question

  • no built-in validation interface is being planned/aimed to get?

I am asking since IMHO relying on downstream projects to reimplement standard "specification" for validation is "suboptimal", and validation should go beyond validating just .zattrs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants