-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically skip running nodes with persisted outputs #2307
Comments
This is definitely something we should have, although I don't have any concrete ideas on the best way to do it off the top of my head. The broader question of "change capture" has been discussed before but I don't think anything was properly decided on. Maybe now would be the right time to re-open those discussions. |
I'd like to add that I'd love this feature. Currently, I have to comment out nodes in my pipelines and add their outputs to the inputs of the pipeline. That's really tedious and seems like an anti-pattern. |
(Our team is working on this and plan to open-source) |
Linking: #2410 |
xref change capture #221 To me, the main difficulty is that doing this requires making assumptions about the node functions, in particular that they're pure, i.e. that they don't have any spurious inputs, like randomness, the current date, and so on. If we assume so, then doing some sort of hashing on the inputs is technically sufficient. As I said in #221, this would make |
Related Update: our team open-sourced |
@sbrugman I was having a look at PyCodeHash, looks superb! One question: what can we do for cases like these?
? These would effectively be cached, am I right? |
This one is not deterministic. The random component (time) should be a parameter/dataset in order for this to work. (Idempotent pipelines are required) |
Closed #221 as a duplicate of this one. The former is older and has some extra context. |
After showing Kedro to a data scientist, this was the first thing they asked. They were familiar with DVC. |
Description
Re-running nodes which have:
is an unnecessary expense. It might be a good idea to have a flag which would automatically skip running these nodes.
It is currently possible to achieve this by specifying nodes to run from, though this process is manual and potentially error-prone.
Context
User @pedro-sarpen opened #2005 to address this issue, though there may be a better solution to the problem that we should investigate.
The text was updated successfully, but these errors were encountered: