Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output caching #587

Merged
merged 16 commits into from
Jan 29, 2019
Merged

Output caching #587

merged 16 commits into from
Jan 29, 2019

Conversation

cicdw
Copy link
Member

@cicdw cicdw commented Jan 29, 2019

Please nitpick away and help me make this design more elegant!

Please describe your work and make sure your PR:

  • adds new tests (if appropriate)
  • updates CHANGELOG.md (if appropriate)
  • updates docstrings for any new functions or function arguments, including docs/outline.toml for API reference docs (if appropriate)

What does this PR change?

This PR introduces new machinery for dealing with result handling (closes #575); namely, results are only handled in two situations: when a task requests output caching, or when a retry-like state requires cached_inputs. Moreover, each individual task can specify how its data should be handled (and this data is passed around in the state._metadata attribute).

Why is this PR important?

Closes #576 -- this PR reduces our current overhead of serializing every single result and makes the handling of data more nuanced and customizable.

@cicdw cicdw requested review from jlowin and joshmeek as code owners January 29, 2019 19:59
@cicdw
Copy link
Member Author

cicdw commented Jan 29, 2019

Test failures are caused by mypy, all other tests pass

Copy link
Member

@jlowin jlowin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

I have a few impressionistic thoughts -- one of which is that _metadata feels very "unknowable." It has a very strictly defined schema -- so strict there's the populate_metadata() method (though I think we could get rid of it?) -- but nonetheless is just a nested dict. Should it graduate to a class (or even a DotDict, which would make the API nicer)?

Relatedly, there's a lot of serialization / deserialization of the resulthandlers themselves from inside the metadata. If the metadata had a known schema, then all resulthandlers could be nested schemas within it, and automatically deserialized when the metadata was loaded.

Just some threads to pull on...

from prefect.serialization.result_handlers import ResultHandlerSchema

## if a state has a "cached" attribute or a "cached_inputs" attribute, we need to handle it
if getattr(state, "cached_inputs", None) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than checking for the attribute, I suggest checking if state.is_pending() and state.cached_inputs is not None:, since only Pending states will qualify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's not entirely correct --> TimedOut states store inputs as well

Copy link
Member

@jlowin jlowin Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIRealized

@cicdw cicdw changed the title [WIP] Output caching Output caching Jan 29, 2019
joshmeek
joshmeek previously approved these changes Jan 29, 2019
jlowin
jlowin previously approved these changes Jan 29, 2019
@cicdw cicdw dismissed stale reviews from jlowin and joshmeek via 01aa3be January 29, 2019 22:19
@cicdw cicdw merged commit 9b096e1 into master Jan 29, 2019
@cicdw cicdw deleted the output-caching branch January 29, 2019 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSONResultHandler for handling small data ResultHandlers should be called in TaskRunner pipeline steps
3 participants