Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deprecating checkpoint names #45

Merged
merged 4 commits into from
Aug 9, 2022
Merged

Support deprecating checkpoint names #45

merged 4 commits into from
Aug 9, 2022

Conversation

rofinn
Copy link
Member

@rofinn rofinn commented Jul 11, 2022

If you want to rename a checkpoint you can now do.

function __init__()
    register(MODULE, ["foo", "bar", "baz", "qux_a", "qux_b", "deprecated"])
    deprecate(MODULE, "quux", "qux_a")
    deprecate(MODULE, "quuz", "qux_b")
end

which makes the deprecation searchable with deprecated_checkpoints and will handle configuring the correct checkpoint after throwing a warning about the change.

This should be a non-breaking change.

@codecov
Copy link

codecov bot commented Jul 11, 2022

Codecov Report

Merging #45 (b662294) into main (8c25fea) will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #45      +/-   ##
==========================================
+ Coverage   96.20%   96.34%   +0.13%     
==========================================
  Files           5        5              
  Lines         158      164       +6     
==========================================
+ Hits          152      158       +6     
  Misses          6        6              
Impacted Files Coverage Δ
src/Checkpoints.jl 94.11% <100.00%> (+1.09%) ⬆️
src/handler.jl 96.87% <0.00%> (-0.10%) ⬇️
src/session.jl 100.00% <0.00%> (ø)

Copy link
Member

@AlexRobson AlexRobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i'm not that used to this package so may not be the best reviewer for this but as i know the issue that this relates to i've made a few comments)

Can the deprecations not be encoded into the Handler? With an appropriate constructor that means it's non-breaking. Right now, iiuc, the deprecations will produce a handler that will always encode the 'new' path. When indexing checkpoints, such as with any package that calls checkpoint_path will always point to the new path. This means that any downstream packages that chooses to support the old name will have to do some mungeing to recreate them - which may be more appropriate to exist here?

To clarify a bit what I'd like to understand is what behaviour is supported with deprecations. A deprecated checkpoint will throw a warning that it is deprecated (and then go on to continue to use the new name) or a deprecated checkpoint will throw a warning but also support old paths.

Speaking as a user, it'd be good to support old paths, or at least having the option to opt-in to them. That way how data versioning is handled isn't presupposed here. Hope that makes sense.

@@ -29,7 +29,7 @@ __init__() = Memento.register(LOGGER)

include("handler.jl")

const CHECKPOINTS = Dict{String, Union{Nothing, Handler}}()
const CHECKPOINTS = Dict{String, Union{Nothing, String, Handler}}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is using types to encode whether the checkpoint is deprecated or not. Thus it uses a String type to denote a deprecated status. Under the hood, it basically encodes a Pair that maps old -> new

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. We could encode this information in as Dict{String, AbstractHandler}() and have dedicated Disabled and Deprecated types. In the end, the string is just acting as a reference.

test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
# a regex / glob syntax
function _config(handler, name::String)
debug(LOGGER, "Checkpoint $name set to use $(handler)")
CHECKPOINTS[name] = handler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will take in the new name for the checkpoint and add in the handler as normal.

IIUC - this will replace the string encoded. I'm slightly confused here - does this mean that it will overright in CHECKPOINTS that it is deprecated, or do both entries remain - i.e does checkpoints look like (some details abstracted away):

:c1 => Handler(c1), :c2 => Handler(c2), "c3" => "c3_new", :c3_new" =>  Hander(c3_new)

Or:

:c1 => Handler(c1), :c2 => Handler(c2),  :c3 =>  Hander(c3_new)

If the latter, won't deprecated_checkpoints be unable to identify the deprecations after config has run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CHECKPOINTS[n] isa String condition at the call-site is to ensure the former and not the latter occurs.

@rofinn
Copy link
Member Author

rofinn commented Jul 12, 2022

Can the deprecations not be encoded into the Handler?

No, because deprecations should be the responsibility of package authors. As a result, we need to define deprecations when the author registers the checkpoints. At the point where users are enabling checkpoints with a particular handler we no longer know about the deprecations. The only alternative would be to introduce an AbstractHandler and dedicated Disabled and Deprecated types.

With an appropriate constructor that means it's non-breaking. Right now, iiuc, the deprecations will produce a handler that will always encode the 'new' path.

I don't see how a different constructor would change that. Regardless of how this information is encoded in the package internals we must choose one of the following:

  1. Have the deprecated name redirect to the new name/handler which is what we're doing here.
  2. Have the new name redirect to the old name/handler until the deprecation is removed.
  3. Run the same handler for both names and handle the checkpoint twice, writing duplicate files.

When indexing checkpoints, such as with any package that calls checkpoint_path will always point to the new path. This means that any downstream packages that chooses to support the old name will have to do some mungeing to recreate them - which may be more appropriate to exist here?

Hmm, maybe? That definitely increases the scope of this PR by adding extra functionality that doesn't seem well thought out yet in our internal code. I suppose the simplest way to add that functionality would be with:

  1. Add a deprecated_names field to IndexEntry
  2. Update index_checkpoint_files to grab the correct checkpoint_name and save any deprecated_names
  3. Define a custom filter(name::String, index::Vector{IndexEntry}) function which does the name searching

That doesn't seem like a very clean API and would be more breaking because any code that serializes the IndexEntry type wouldn't be able to deserialize it.

To clarify a bit what I'd like to understand is what behaviour is supported with deprecations. A deprecated checkpoint will throw a warning that it is deprecated (and then go on to continue to use the new name) or a deprecated checkpoint will throw a warning but also support old paths

Since Checkpoints.jl isn't generally responsible for loading checkpoints, though IndexEntry seems to blurs that now, we just redirect to the new behaviour. This seems the most intuitive to me because that's how deprecations in Julia work. Deprecating foo(args...; kwargs...) just redirects to bar(args...; kwargs...). As noted above, we could do the inverse or save to both locations, but that seems less ideal.

Speaking as a user, it'd be good to support old paths, or at least having the option to opt-in to them. That way how data versioning is handled isn't presupposed here. Hope that makes sense.

I think with the current codebase and feature set there isn't any easy way to support old paths without adding unnecessary cruft. As you mentioned earlier, we could integrate the deprecations into the IndexType to make searching for checkpoints less manual, but that doesn't change where the checkpoint is saved. I think giving users the option to save to multiple possible filenames is asking for bugs.

@rofinn
Copy link
Member Author

rofinn commented Jul 12, 2022

I've added a commit to include the deprecated names in our index, though it does feel a bit ugly. Loading the deprecation map correctly depends on having the correct package / version that generated the checkpoint loaded. Seems unlikely that we'd want to load checkpoints for packages without having the package loaded, but it's worth noting.

@AlexRobson
Copy link
Member

AlexRobson commented Jul 13, 2022

Since Checkpoints.jl isn't generally responsible for loading checkpoints, though IndexEntry seems to blurs that now, we just redirect to the new behaviour

OK interesting. Responding to this first, as this is actually at the heart of my previous comments, as I was assuming that this was the case - (not necessary loading, but identifying the path). Deprecating a checkpoint and saving to a new name is entirely reasonable (and fits the deprecation pattern you point out). However with the way that the path is encoded my understand is that all previous checkpointed data essentially becomes lost to us ("can't find checkpoint foo but we can find foo_new"). Data that is actually quite costly to create. My view was that it was up to the downstream package to choose whether to support loading the deprecated checkpoints (or not). Without this, there isn't much gain in deprecating as it's no longer backwards compatible anyway, with the purpose of deprecation to act as a bridge between old and new.

EDIT: Peering a bit closer, I now see that the handler and pathing look entirely separate. The indexing simply comes from selecting the jlso files within a path. This is a misunderstanding I had, and was why I was referring to updating the handler previously.l

@AlexRobson
Copy link
Member

AlexRobson commented Jul 13, 2022

I've added a commit to include the deprecated names in our index, though it does feel a bit ugly. Loading the deprecation map correctly depends on having the correct package / version that generated the checkpoint loaded. Seems unlikely that we'd want to load checkpoints for packages without having the package loaded, but it's worth noting.

OK I think without my above misconception (that the package had more responsibility in loading) I see the merit in the original MR. IIUC this last commit correct, the IndexEntry essentially knows how to distinguish between real and deprecated files though as you point out this is a bit of a workaround. There is no issue before in actually indexing the deprecated files - they will still appear, it's just downstream it will look for 'name' but only finds 'old_name'.

With that in mind, fwiw my opinion is that the last commit should be removed and keep the indexing as it was before, given that responsibility for loading these checkpoints. One intermediate option, that may be worth incorporating an isdeprecated flag into IndexEntry which (i believe) would be something simple (ignoring details) like: occursin(checkpoint_name(x::Path), first.(deprecated_names())) - that way at least downstream knows that there are some deprecations going on. Then downstream needs to decide how it handles the deprecations.

What do you think? Sorry for the two-and-fro - I was originally surprised we would need to work on our downstream package to handle this but on reflection I think that it makes sense what you had originally (both the MR here and into actions)

@rofinn
Copy link
Member Author

rofinn commented Jul 18, 2022

all previous checkpointed data essentially becomes lost to us ("can't find checkpoint foo but we can find foo_new")

That seems like a bit of an exaggeration. There are still plenty of breadcrumbs to find what you're looking for. Since we only do this kind of searching in 1 downstream package right now we just need to inspect the deprecated checkpoints list first. I can only think of two cases where the checkpoint might be "lost":

  1. You don't have the author package loaded in the environment where you want to load the checkpoint.
  2. The environment saving the files is on a newer version of the author package than the environment loading it.

Honestly, if this only ran on a local filesystem I'd say we could just use a symlink. We could update the handler to write a .deprecated file which acts like a symlink regardless of the base filesystem, but that seems ugly and depending on how you're searching it'd still break? I think ultimately based on the recent additions to this package (e.g., search, indexing, arbitrary keys) we basically want a database where each row points to a file.

  1. Search would be easier
  2. Path structures would be codified in a table schema
  3. Deprecations could be handled with duplicate rows pointing to the same file

I'm not really sure what the isdeprecated middle ground gets us. The original proposal had a deprecated_checkpoints function which could be used to grab deprecated checkpoints and adding isdeprecated would still depend on the author package being loaded. I'm inclined to revert the last commit as I've made a separate PR where we can better discuss what search functionality we want.

@AlexRobson
Copy link
Member

all previous checkpointed data essentially becomes lost to us ("can't find checkpoint foo but we can find foo_new")
That seems like a bit of an exaggeration.

OK point made.

Honestly, if this only ran on a local filesystem I'd say we could just use a symlink.

Any applications I personally use checkpoints with all the data is on S3. I'm not entirely sure how they are handled when they are loaded via awss3. Regardless, creating symlinks is a workaround and not really a solution to graceful loading of deprecated checkpoints. Further, as you pointed out, checkpoints isn't strictly responsible for loading.

I'm inclined to revert the last commit as I've made a separate PR where we can better discuss what search functionality we want.

Cool - fwiw yeah I agree.

I'm not really sure what the isdeprecated middle ground gets us. The original proposal had a deprecated_checkpoints function which could be used to grab deprecated checkpoints and adding isdeprecated would still depend on the author package being loaded.

This was really in response to seeing the kind of pattern being added to our downstream package to handle the deprecated name. All this was suggesting was a utility to know if a particular IndexEntry was pointing to a deprecated checkpoint (or not). It would use deprecated_checkpoints. At first glance, such a utility looked like it would make that handling a bit more compact, but if that's a misunderstanding or redundant, no worries. I don't understand the comment about the author package being loaded. Sorry.

Anyway for the purposes of getting this over the line I don't think that there is anything blocking it? Other than reworking the last commit into a separate MR that we seemed to mutually agree on. I'll see if I can approve but up to you want other (more experienced than my) eyes on it!

@rofinn rofinn requested a review from morris25 July 19, 2022 18:25
src/Checkpoints.jl Show resolved Hide resolved
src/indexing.jl Outdated Show resolved Hide resolved
Co-authored-by: Alex Robson <AlexRobson@users.noreply.github.com>
@rofinn rofinn merged commit b45e5ad into main Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants