-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial validate support, with --strict option #142
Conversation
Codecov Report
@@ Coverage Diff @@
## master #142 +/- ##
==========================================
+ Coverage 83.90% 84.39% +0.49%
==========================================
Files 12 12
Lines 1379 1442 +63
==========================================
+ Hits 1157 1217 +60
- Misses 222 225 +3
Continue to review full report at Codecov.
|
Hi @will-moore thanks for the ping. I'm waiting a bit to advance on the rest of the validation, mainly because I'm not sure of what will be left to validate once you're done :) If so I'll try to get that started soon |
Trying to capture knowledge from our historical experience with XML and XSD schemas, my understanding is that the semantics of JSON schemas is very similar on this particular issue and only distinguish between
Brainstorming a few ideas:
|
@glyg I wouldn't claim that the list there is complete - It's just what I picked up from the spec, looking for ome-zarr-py/ome_zarr/reader.py Line 287 in a3f5028
ome-zarr-py/ome_zarr/writer.py Line 17 in a3f5028
@sbesson The method I've used here (adding more restrictions to the schema via merging python dicts) seems to work so far and could be extended for various levels. It's still limited by the JSON schema validation, so the logic can't get too complex, but it seems worth exploring in more depth. |
@sbesson I've updated the approach to match the Re: required, optional and recommended: The spec uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a substantial step forward in the implementation of a first validation API consuming the newly introduced JSON schemas. The strategy of using an allOf
schema allows to validate recommended attributes in addition to the mandatory ones (covered by the default JSON schema). Interested to hear alternate implementation suggestons but from my side, this feels like a solid approach to encode and validate the standard MUST/SHOULD/MAY semantics.
Several suggestions/comments made inline on the implementation/packaging/API. The .DS_Store
files also need to be removed (and added to .gitignore
)
Probably beyond the scope of this PR, capturing a few thoughts about next steps:
- should minimal validation be opt-in by default (leaving strict validation as an option)? Assuming the main drawback to doing so would be a performance impact, I suspect this will need to be profiled e.g. using complex datasets (HCS) stored remotely and look into the time spent validating vs performing other logic.
- should validation also be added to the writing code?
ome_zarr/cli.py
Outdated
# validate | ||
parser_validate = subparsers.add_parser("validate") | ||
parser_validate.add_argument("path") | ||
parser_validate.add_argument("--warnings", action="store_true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the role of this argument to trigger the validation of the recommended attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. We have various terminology: There are Spec rules that SHOULD
be followed (so it's recommended
that you do). These are covered by a strict schema
and if you don't then this gives you warnings
(if you use that flag)!
It would be nice to settle on ONE term that could cover all these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Trying to look into the terminology used by others, I came back to SHACL which we have not decided to adopt at this stage but we might want to come back to in the near future.
A relevant concept is the idea of severity defined here and used by the validation report. In particular, looking at the pySHACL library, the --allow-warnings
option effectively controls the severity level beyond which the result will be considered as invalid.
ome_zarr/utils.py
Outdated
|
||
if hasattr(node, "validate"): | ||
node.validate(warnings) | ||
yield node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the logic is duplicate from def info
i.e. the graph is traversal with an action being perform at each level node.print
vs node.validate
.
This raises the question of whether the two commands should be combined e.g. to satisfy the use cases where someone will want to inspect an OME-Zarr, validate it and output a minimal report if the validation passes? There is likely an argument to doing both actions independently (validation or display information) but could that be equally achieved via flags (--validate/--no-validate
and --report/--no-report
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you just do this: does it report
by default (and validate)?
$ ome_zarr info image.zarr
Then you can disable both with: ?
$ ome_zarr info image.zarr --no-validate --no-report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my 2cents here: I think the cleanest design would be to implement def visit(path: str, func: callable)
that evaluates func
for each node. This could then be used to implement both info
and validate
without much code duplication, i.e. both info
and validate
would just use visit
with an appropriate function.
This has the added advantage that it allows further customization by users. If you go for this functionality I would also adopt the "usual" syntax for visit in python: If func
returns something other than None
, visit
terminates and returns it.
This will fail checks until ome/ngff#77 is merged and released on pypi so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two inline comments regarding the prefixing of the package/module.
I am generally in favor of being able to consume schemas from a package rather than copying/pasting and this approach will become increasingly needed as the number of consumer repositories increases and this will help with propagating updates.
Probably my biggest concern is related to the usage of github.com/ome/ngff as a single repository for the specification + the schemas + the samples. Assuming we start covering more languages, we will run into similar issues that we are already facing with github.com/ome/ome-model. That being said, I have not found a good working example of an alternate structure. Happy to hear other feedback including from the community but immediately I see gaining experience with the proposal of ome/ngff#77 while being aware that things might be split in the future.
Although we are still exploring and not 100% clear on the best approach, I assume this strategy can then be easily adapated for other languages (Java/JAR, JS/npm) if we start increasing our language coverage
ome_zarr/reader.py
Outdated
@@ -8,7 +8,11 @@ | |||
import dask.array as da | |||
import numpy as np | |||
from dask import delayed | |||
from jsonschema import validate as jsonschema_validate | |||
from jsonschema.validators import validator_for | |||
from ngff.schemas import LocalRefResolver, get_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier today, if we go for an upstream packages containing the schemas as a resource, my vote goes towards prefixing the package/module with ome
i.e. ome_ngff.schemas
setup.py
Outdated
@@ -23,6 +23,8 @@ def read(fname): | |||
install_requires += (["requests"],) | |||
install_requires += (["scikit-image"],) | |||
install_requires += (["toolz"],) | |||
install_requires += (["jsonschema"],) | |||
install_requires += (["ngff"],) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think our project should be prefixed with ome
i.e. ome-ngff
inline with the differentiation between NGFFs and OME-NGFF that we settled on in the paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be really valuable to have this in, @will-moore. I guess this is pending ome/ngff#77. Anything else blocking here?
ome_zarr/utils.py
Outdated
|
||
if hasattr(node, "validate"): | ||
node.validate(warnings) | ||
yield node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my 2cents here: I think the cleanest design would be to implement def visit(path: str, func: callable)
that evaluates func
for each node. This could then be used to implement both info
and validate
without much code duplication, i.e. both info
and validate
would just use visit
with an appropriate function.
This has the added advantage that it allows further customization by users. If you go for this functionality I would also adopt the "usual" syntax for visit in python: If func
returns something other than None
, visit
terminates and returns it.
Added |
@satra mentioned that it would be useful to add validation of Zarrs into https://github.com/dandi/dandi-cli (After I pointed out this 😄) Thoughts on getting this into a release? |
ome_zarr/reader.py
Outdated
localResolver = LocalRefResolver.from_schema(strict_schema) | ||
validator = cls(strict_schema, resolver=localResolver) | ||
for error in validator.iter_errors(json_data): | ||
LOGGER.warn(error.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, not familiar with the code/design pattern here, can't see if there is some MVC like pattern where the warnings&errors (the Model) are collected/returned by some validate
(the Controller), and then displayed (by the Viewer).
For me, as a 3rd party user intending to use this library to validate it, would be useful to have clear separation of the Model/Controller from the Viewer which would be "mine".
Here dandi/dandi-cli#943 (comment) we are to "converge" on some data structure which we would use to cover validation reports across multiple validators possibly used for any specific file/dataset at hands. So it would be important for us to be able to collect errors/warnings/hints uniformly first and then report them via our own "viewer".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback. This work is at quite an early stage, so we haven't considered the usage of this tool outside of a cli validation. In fact, with changes to the schemas at https://github.com/ome/ngff/tree/main/0.4/schemas (include regular and "strict" schemas), there should be less logic required to validate.
You may find that you can choose the JSON schema validation tool of your choice, and don't need anything from ome-zarr-py
.
In any case, this PR needs to be reviewed...
cc @sbesson
@sbesson It looks like https://ngff.openmicroscopy.org/0.4/schemas/plate.schema is |
With that last commit, renamed
Well:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR and pushing this important functionality forward.
A few questions:
- the usage of
cached_path
looks like an interesting alternative in the absence of a clear path forward for packaging/distributing thef JSON schemas. What is the library behavior when a live JSON schema is amended? Is the local cached version invalidated and the new version refetched? - looking at the
get_schema
implementations for each specification, it only needs two elements: 1- the version of the specification, 2- the mapping betweenSpec
and schema name. Could this be simplified by populating the version as part of the__init__
method since the lookup is already made there and storing the schema name as a class constant? - is there a way to extend the unit tests to add some coverage of the new validation infrastructure?
resolver = RefResolver.from_schema(strict_schema, store=schema_store) | ||
validator = Validator(strict_schema, resolver=resolver) | ||
for error in validator.iter_errors(json_data): | ||
LOGGER.warn(error.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to logging errors at a WARN level, what is the expected contract of self.validate(strict=True)
if the JSON does not meet the strict schema? Should an error be thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in #142 (comment) above, I'm thinking of separating the reporting of errors from the validation itself. So, the cli validate
command can log or print errors, but the Spec.validate()
and utils.validate()
would only return errors / messages (and maybe log at INFO level)?
It could get tricky to give the user an accurate picture of what failed: eg. when you validate an Image with Labels, the image
schema is used for both, so you can get an output like this and it's hard to know which object was missing "name":
$ ome_zarr validate 6001247.zarr/ --strict --clear_cache
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property
WARNING:ome_zarr.reader:'name' is a required property
WARNING:ome_zarr.reader:'metadata' is a required property
WARNING:ome_zarr.reader:'type' is a required property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of jsonschema.validate
vs jsonschema.iter_errors
, I concur with the idea of having an API that returns an iterable of all exceptions.
Will the contract be the same for the default and the strict schemas? In the current implementation, the base schema raises an exception but the strict schema does not.
Also I assume since the strict schema extends the base schema, it is not necessary to have 2 validation calls?
ome_zarr/reader.py
Outdated
def get_schema(self, strict: bool = False) -> Optional[Dict]: | ||
multiscales = self.lookup("multiscales", []) | ||
version = multiscales[0].get("version", CurrentFormat().version) | ||
return get_schema("image", version, strict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the scope of this PR, this probably points out the need to split image.schema
into omero.schema
and multiscales.schema
Thanks: currently,
whereas elsewhere I've been using |
At lest from my side, yes. I assume that's the spirit of the comment in ome-zarr-py/ome_zarr/reader.py Line 284 in 6c24f13
I do not see a reason why Multiscales should not be treated differently from other specs. If the consensus of the library is to default to the latest published specification in the absence of version metadata, it makes sense to apply it everywhere.
While on the topic of version retrieval, I also noticed another location ome-zarr-py/ome_zarr/format.py Lines 67 to 82 in 6c24f13
Spec.get_version() API that can be consumed both by Spec.validate() and by Format.matches ?
|
Need to create a format instance (of the correct version) since this isn't a static methods
@sbesson I've used |
@sbesson I added a |
Is there replacement for this PR or some issue to get announcement whenever validation would become possible? |
There is some validation code (using pydantic) at https://github.com/czbiohub/iohub/blob/main/iohub/ngff_meta.py but I haven't tried it myself yet. You can also use the JSON schemas to validate JSON, e.g.
|
Thank you for the pointers -- I will check them out! Since this PR was closed without any comment left to describe decision to close, a quick follow up question
I am asking since IMHO relying on downstream projects to reimplement standard "specification" for validation is "suboptimal", and validation should go beyond validating just .zattrs. |
Work in progress...
I am testing the JSON schemas from ngff repo, downloading them to a cache with https://github.com/allenai/cached_path.
This PR adds a
validate
command that validates the JSON at each Node (currently only working for Multiscales spec).NB: some validation occurs during the init of the Nodes e.g. axes matching xyzct etc. so if that fails then we don't even get to validate against the schemas.
We use
--strict
flag to usestrict_
schemas to show warnings where strict rules are not met.cc @glyg