Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning codecs #148

Closed
jakirkham opened this issue Jul 21, 2022 · 11 comments · Fixed by #187
Closed

Versioning codecs #148

jakirkham opened this issue Jul 21, 2022 · 11 comments · Fixed by #187
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@jakirkham
Copy link
Member

Currently codecs don't have a version. Thus if they are changed in a breaking API, this would break loading the data. Would it make sense to include one? Or would we want to handle this a different way (like naming the codec differently, for example bz2 & bz2_2)?

xref: fsspec/kerchunk#198 (comment) (where this came up originally)

cc @martindurant @joshmoore

@jbms
Copy link
Contributor

jbms commented Jul 21, 2022

It seems like just using a different identifier for the codec would be simpler, but an individual codec could also use a version field as part of its json representation if that were useful. It might help to consider specific examples.

@jakirkham
Copy link
Member Author

Yeah that sounds like the naming suggestion above. Agree that's one way to go about it.

Think Martin had a more specific example. So hopefully he can chime in 🙂

@martindurant
Copy link
Member

I managed to make my change not break API on this occasion, but I don't anticipate being able to every time. I would probably do the version argument rather than a new name for when I expect all new data to use the new code and only maintain the old one for existing data. In either case, this leaves it up to the codec authors to make the decision, but it might be a nice thing to mention in our developer docs.

@jakirkham
Copy link
Member Author

So interesting question, how would the old data get loaded if there was a break? Would the library need to keep around 2 decoders? Would users need to go back to an earlier version of the library? Or should something else be done?

@martindurant
Copy link
Member

You would need to keep the old code, referenced with the same codec name - unless we invent some other mechanism

@jakirkham
Copy link
Member Author

jakirkham commented Jul 22, 2022

Should version number increments only indicate breaking changes or could they indicate other things? If the latter, when else would we want to use them?

@martindurant
Copy link
Member

Up to the author, I suppose, but if we have a simple version number like 1, 2, ..., then I suppose breaking changes as in semver. It could be conceivable to have more prescriptive codec names in the .zarray like {"id": "gzip~=1.2.3"}, but we usually save that kind of stuff for an environment file. I note that intake catalogs, for example, only give functions and arguments and versions thereof.

@jakirkham
Copy link
Member Author

Limiting to breaking changes makes sense.

Was thinking about reproducibility (IOW if someone wants to use the exact same libraries to read as wrote the data). Though maybe that can be captured in separate optional metadata ( #139 ).

@martindurant
Copy link
Member

Yeah, I'm not sure how much information it makes sense to include directly in the dataset, as opposed to catalog or other metadata location (e.g., unique run ID for pangeo-forge).

@joshmoore
Copy link
Member

As someone who won't regularly have a catalog, I'd vote for adding this to the dataset itself. In my mind, it's the schema of the config that we're versioning, no? If a single json-schema (without ORs) can't validate the config, then I assume you'd to point to a different schema (i.e., a different purl)

@jakirkham
Copy link
Member Author

It seems we agree that having the codec version is useful

Though the last few comments start discussing other version info (like info about the writer). Do we want to keep discussing that here or raise a new issue? If the former, maybe we can start enumerating what other versioned info/metadata we would want to include.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants