-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toposort fails when using transcoded datasets #3799
Comments
^ the above error is flaky, I don't know why it shows up sometimes but not always. Maybe ignore it for now. I have a new theory that this bug is actually not new, at least the I suspect we did not strip the transcoding consistency and causing some funny thing here.
|
A short test to confirm @noklam outcome that the problem is not in topological sort. So the only difference is that
The actual problem lies in the way we get node_dependencies and _nodes_by_input. When getting
We always
So in the following example, we treat the output of
|
Just rolling back to the old
@noklam, @astrojuanlu, @merelcht, @idanov, the solution here would be to differentiate inputs/output like |
I agree, though I think it generates a false dependency but still resolves in the correct execution order (it would be much worse if the order is wrong). @ElenaKhaustova Do you understand what is transcoding dataset already? https://docs.kedro.org/en/0.17.5/05_data/01_data_catalog.html#transcoding-datasets may help. Without transcoding the resulted graph will be a disconnected graph, so the striping allow connecting the nodes properly. It makes more sense if you look at the @spark @pandas example. In other words, if the nodes are disconnected, the execution order can be wrong because Kedro doesn't understand there is a dependency. Maybe the bug is caused by stripping in the wrong place or we did not strip it consistently? |
Another problem related to the current implementation of transcoding datasets is that we can get circular dependencies when we do not have them.
In this case, If rolling back to 0fc8089, we get the following error, which raises before topological sort is applied:
|
Summary:
@lrcouto The question is if we want to make a release before solving item 2. |
I personally don't think we should release before solving this issue. We could do what @merelcht suggested yesterday and change the node names on the starters just to make it pass the CI, but that would be a band-aid fix. |
We agreed to timebox this to avoid the risk of rushing an improper solution. Targeting a release ~early next week. |
The issue is indeed that we have uncovered something that shouldn't be allowed in the first place, and namely using transcoding as a cheat for adding circular dependencies in nodes:
Transcoding should always be stripped when resolving ordering issues and validating pipelines and nodes for cycles. So the correct action here would be to fix the template. This should raise an error upon node or pipeline creation, depends how we want to define the behaviour, either will be valid. Transcoding was only made to enable hand-offs between pandas and pyspark nodes, or similar use cases. The starter here has gotten it the wrong way around - it's not the format at rest, but the format in-flight that we need to specify here. As far as I can see, there's no bug in Kedro now, there was before due to a peculiarity of In order to make the error reporting more user-friendly, we might decide to add an extra check in the nodes creation, but that doesn't have to happen before the release. Well done to @noklam for doing all the research and tracing it back to |
Applied changes:
Further updates required:
|
Description
The new toposort seems to not handle it well when transcoded datasets are involved.
Context
Appeared in kedro-org/kedro-starters#215
Possible change that caused it: #3728
Steps to Reproduce
When you have a node like
node_dependencies are resolved as:
If we don't call
_strip_transcoding
it resolves it as:{Node(load_shuttles_to_csv, 'test_data@excel', 'test_data@csv', 'load_shuttles_to_csv_node'): set()}
The problem disappears if giving the datasets a different name before the
@
so instead oftest_data@csv
andtest_data@excel
make it:test_csv_data@csv
andtest_excel_data@excel
Expected Result
The error does not appear when using transcoded datasets.
The text was updated successfully, but these errors were encountered: