Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve kedro run as a package #1423

Closed
wants to merge 7 commits into from
Closed

Conversation

antonymilne
Copy link
Contributor

@antonymilne antonymilne commented Apr 7, 2022

⚠️ This is a huge description because I spent ages figuring this out and I thought we should have it all spelt out somewhere. Haven't done tests or docs yet - just opening this to get initial feedback.

The actual change to code is very small, but a big improvement I think: you can now do

from spaceflights.__main__ import main
outputs = main(pipeline="ds")
print(outputs)

and it will Just Work, even in IPython/Juptyer. All the command line ways of invoking main work exactly as before, i.e. python -m spaceflights --pipeline=ds etc. We now have consistency between main and CLI ways of launching kedro.

Description

A user can run their kedro project in several ways. Note that the run command executed can be defined on the framework side or overridden in turn by a plugin or a project cli.py (done by _find_run_command).

  1. Kedro CLI: kedro run. This is the only route that doesn't go through the project __main__.py; instead it goes through kedro.framework.main, which builds the CLI tree and does something like _find_run_command
  2. CLI but not kedro CLI: python -m spaceflights. This hits the project __main__.py and will call main
  3. CLI but not kedro CLI: python src/spaceflights. This is just a more unusual way of doing 2
  4. Inside your own python script: from spaceflights.__main__ import main; main(), then run the script using python
  5. Inside IPython/Jupyter with no kedro ipython extension: same as 4 but you run it in the notebook
  6. Inside IPython/Jupyter with kedro ipython extension: same as 5 but you can also do session.run (which is what we have advertised as the way to do a kedro run in the past)

All the above must be run from within the project root or kedro won't be able to find your conf. Options 2 onwards needs you to have pip installed your project or to have src in your PYTHONPATH. Note that having the project pip installed could mean first doing kedro package and then pip install the resulting .whl file or it could mean just pip install ./src from your project root; it doesn't make a difference.

Current problems

  1. When using 1, 2, 3 and 4 you specify arguments using the CLI syntax --pipeline ds. With main() you do main(["--pipeline", "ds"]), i.e. the CLI syntax shoehorned into a Python function call. When using session.run you do a pure Python function call as session.run(pipeline_name="ds") - but note the argument name is different!!
  2. Methods 5 and 6 run ok but generate a horrible big mysterious ipython exception after completing
  3. Method 4 isn't actually useful if you want to run a kedro project with anything else since (big problem) no code after main() will actually execute...
  4. ... and (smaller problem) main() doesn't return anything like session.run does, so you couldn't use any kedro outputs downstream anyway
  5. Method 6 using main doesn't work unless you call session.close first because you'll get some error about there already being an active session, same as %run_viz line magic requires session.close first kedro-viz#811

This PR fixes 1, 2, 3, 4. We should have separate ticket(s) to fix 5 and also the following:

  • move the horrible code out of the project __main__.py to somewhere framework side, because no one wants to see that in their project... Possible locations might be kedro.framework.cli.utils or kedro.framework.project
  • figure out if we can get a user to do a nicer looking import than from spaceflights.__main__ import main, which doesn't really look like it should be done
  • fix a bug where commands defined in the project cli.py aren't given their full path when you do kedro but instead Project specific commands from cli. I think this is a bug anyway and it didn't used to be like this...
  • consider changing the behaviour when the project cli.py exists but is empty/all commented out. I found the current behaviour a bit annoying when playing around with this
  • consider whether we should recommend main over session.run in IPython or even stop exposing session altogether. Mooted briefly in Technical design decision record for KedroSession  #1335 but needs more discussion. main is a higher-level function than session.run and respects the hierarchy of framework < plugins < project cli.py. The arguments to main also match the usual kedro run arguments, which those in session.run do not

Development notes

There's basically two things that have changed:

  • add return to the kedro run command. session.run, and hence main, returns the free outputs of the pipeline after the kedro run. This fixes problem 4
  • change how main calls the run command. This fixes problems 1, 2, 3

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

  • run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1
  • run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible
  • run.main is equivalent to what we were doing before (run.__call__) so doesn't help
  • run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Overall, the code here is I think the unique thing that does exactly what we want both on the CLI and in scripts/ipython 🎉

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@antonymilne
Copy link
Contributor Author

P.S. technically this is as breaking change since the way you call main with arguments has changed. But given that the main(["--pipeline", "ds"]) syntax is so new and not well-known I think we should just go for this in 0.18.1. It would be really nice to have this working for databricks, etc.

@antonymilne antonymilne requested a review from noklam April 7, 2022 14:39
@noklam
Copy link
Contributor

noklam commented Apr 8, 2022

This is a nice summary, also reminds me I have some challenges to make kedro pipeline works with other python code. I think this would help integrate kedro with any downstream tasks.

#795. I am not sure if the discussion still holds or things can be done more easily now.

I think this doc is also useful to lay out how people can run kedro pipeline.

Method 4 isn't actually useful if you want to run a kedro project with anything else since (big problem) no code after main() will actually execute...

Why is this happening?

from spaceflights.main import main

It did look a bit weird, but I think it is because we want to execute python -m spaceflights thus we need to put it in a __main__.py file? Although I haven't see any project actually use this before.

consider changing the behaviour when the project cli.py exists but is empty/all commented out. I found the current behaviour a bit annoying when playing around with this

I am not super familiar with click and how cli.py and plugins get pick up by kedro, so can't comment on this now.

consider whether we should recommend main over session.run in IPython or even stop exposing session altogether. Mooted briefly in #1335 but needs more discussion. main is a higher-level function than session.run and respects the hierarchy of framework < plugins < project cli.py. The arguments to main also match the usual kedro run arguments, which those in session.run do not

I do like the idea of a higher-level entry point,

Misc:
This is not specific to kedro run, but I notice when I do kedro run I would get a dependencies error (Pip's error, and it's buried in the log that are not obvious at all), complaining that it doesn't match with the plugins, even I am not running the plugin at all.

@idanov idanov added this to the 0.18+ milestone Apr 20, 2022
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@merelcht merelcht removed this from the 0.18+ milestone Apr 25, 2022
Signed-off-by: Antony Milne <antony.milne@quantumblack.com>
@noklam
Copy link
Contributor

noklam commented Jun 9, 2022

Just an idea that pops up, do we only expect kedro run as a self-contained package? It is possible a kedro pipeline behaves just like a normal function where it takes an argument instead of catalogue only?

@noklam
Copy link
Contributor

noklam commented Aug 26, 2022

@AntonyMilneQB

In #1807 this would work for packaged kedro project. What's the benefit of calling the CLI instead of just using KedroSession, which is the preferable way of calling kedro pipeline programmatically?

One thing that I can think of is when you have a custom cli.py, but I don't think that's something we promote and isn't a common use case.

from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)

with KedroSession.create(package_name) as session:
    session.run()

@antonymilne
Copy link
Contributor Author

antonymilne commented Sep 8, 2022

@noklam sorry for the incredibly slow response - finally got around to this issue on my big list of things to respond to... 😅

There's a few problems with doing session.run like that:

  • python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from command line without needing to write a Python script to do it
  • in the context of a Python script, your code works fine but doing configure_project and a context manager for a KedroSession feels overly complex, fiddly and too low-level. There should be a more user-friendly way that wraps all that into a single function call. It should be more obvious how to run a kedro pipeline programmatically. The code you wrote is fine for our "under the hood" implementation but I feel like there should be more accessible higher-level function to do it
  • as you say, it doesn't respect the cli.py run command > plugin run command command > kedro run command hierarchy. In the future I hope this will become increasingly irrelevant. As you've noted before, the cli.py route is not well used (more of a historical leftover I think) so maybe it should be removed. The use of a plugin to override run will hopefully become less common if we do something like Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041 also. But for now (0.18.x), I think maintaining this hierarchy is worth doing

@noklam
Copy link
Contributor

noklam commented Sep 8, 2022

Thanks @AntonyMilneQB, please see my comments.

def run(
tag,
env,
runner,
is_async,
node_names,
to_nodes,
from_nodes,
from_inputs,
to_outputs,
load_version,
pipeline,
config,
params,
):
"""Run the pipeline."""
runner = load_obj(runner or "SequentialRunner", "kedro.runner")
tag = _get_values_as_tuple(tag) if tag else tag
node_names = _get_values_as_tuple(node_names) if node_names else node_names
with KedroSession.create(env=env, extra_params=params) as session:
session.run(
tags=tag,
runner=runner(is_async=is_async),
node_names=node_names,
from_nodes=from_nodes,
to_nodes=to_nodes,
from_inputs=from_inputs,
to_outputs=to_outputs,
load_versions=load_version,
pipeline_name=pipeline,
)

  • python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from the command line without needing to write a Python script to do it

I understand this need, but I also doubt if there are any people doing this for a real project, executing a module entirely with python -m isn't a common thing that I do. With a kedro project it's likely you have more than one pipeline, and will need different parameters, so it feels quite limited. If package_name is a useful argument in KedroSession (See point 2), then it can be just group in kedro run --package_name=xxx and all other arguments, which is much more consistent and powerful. It's not a good reason to keep yet one more CLI entrypoint when kedro run can easily do the same thing.

  • in the context of a Python script, your code works fine but doing configure_project and a context manager for a KedroSession feels overly complex, fiddly and too low-level. There should be a more user-friendly way that wraps all that into a single function call. It should be more obvious how to run a kedro pipeline programmatically. The code you wrote is fine for our "under the hood" implementation but I feel like there should be more accessible higher-level function to do it

I agree that configure_project is too low-level as a user-facing function. I think the problem here is what KedroSession.create(package_name=xxx) does, it actually did nothing. The only reason why it works in a kedro project is we did this configure_project steps behind the scene, either through cli or reload_kedro, what you pass into KedroSession doesn't matter at all.

I am not sure if KedroSession and context managers are too low level though, given this is the standard way to work with the notebook (Databricks) or any Python API. Currently, I view session.run as the closest Python API to kedro run.
If the package_name argument is doing its job, then it should be the same for both project or package mode. This is also quite consistent with the current way. kedro run is just using KedroSession.run and the only job of the cli is parsing the argument.

Why do we need a context manager? Is that a good reason to keep them at all? Note that both extra_params and env are in the context manager because we are saving them to session_store, but it doesn't add anything at the moment. In the past, it has some responsibility of activating/deactivating sessions, but that was removed for a while already. In reality, "closing" a session doesn't do anything. So in theory there can be a kedro_run function which basically just do "Create a Session and run a session".

  • as you say, it doesn't respect the cli.py run command > plugin run command command > kedro run command hierarchy. In the future, I hope this will become increasingly irrelevant. As you've noted before, the cli.py route is not well used (more of a historical leftover I think) so maybe it should be removed. The use of a plugin to override run will hopefully become less common if we do something like Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041 also. But for now (0.18.x), I think maintaining this hierarchy is worth doing

I think this is the only compelling reason to me for keeping backward compatibility. But as you said, it's not something we want to keep and this is also a very uncommon usage.

Ideally, there should be only 1 CLI and 1 way of doing it in Python API, either python scripts or IPython. And I think this isn't hard to achieve as we tidy up some of the KedroSession leftover.

@antonymilne
Copy link
Contributor Author

  • python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from the command line without needing to write a Python script to do it

I understand this need, but I also doubt if there are any people doing this for a real project, executing a module entirely with python -m isn't a common thing that I do. With a kedro project it's likely you have more than one pipeline, and will need different parameters, so it feels quite limited. If package_name is a useful argument in KedroSession (See point 2), then it can be just group in kedro run --package_name=xxx and all other arguments, which is much more consistent and powerful. It's not a good reason to keep yet one more CLI entrypoint when kedro run can easily do the same thing.

This might not be done much currently but it is a useful feature and something we should promote more (especially when it's fixed 😀).

  • one way which I didn't put in the original list is that you can just run spaceflights if you've first done kedro package and then pip installed the .whl file. This is essentially the same as doing to python -m spaceflights but not quite as safe because you can sometimes get confused Python environments (in the same way that python -m pip is recommended over pip)
  • perhaps surprisingly, and completely undocumented I think, this method allows arguments to be passed exactly as in kedro run, e.g. spaceflights --pipeline=data_science
  • kedro package ➡️ share .whl file ➡️ pip install .whl file ➡️ run python -m spaceflights is the simplest way to reliably share and deploy a project. Think of my kedro project as a Python package here that I'm trying to distribute. e.g. I have a kedro project I want to send to you so you can run it. I could zip up the code, send it to you, ask you to pip install the requirements and then execute kedro run. But the way to share Python packages is to distribute them (e.g. as a whl file). This way you get all the advantages of distributing something as an actual package, e.g. package versioning, automatic dependency installation when you pip install my spaceflights.whl. Philosophically it's also significant that you don't need to care that it's using kedro underneath. It's a black box to you - just a command you can run like black or git or whatever
  • the above point actually won't work as smoothly as that in reality because of the issue we have with not packaging conf, so I need to send that to you separately also... But you get the idea. In principle python -m spaceflights/just spaceflights is the correct way to execute a packaged project

I agree that configure_project is too low-level as a user-facing function. I think the problem here is what KedroSession.create(package_name=xxx) does, it actually did nothing. The only reason why it works in a kedro project is we did this configure_project steps behind the scene, either through cli or reload_kedro, what you pass into KedroSession doesn't matter at all.

I am not sure if KedroSession and context managers are too low level though, given this is the standard way to work with the notebook (Databricks) or any Python API. Currently, I view session.run as the closest Python API to kedro run. If the package_name argument is doing its job, then it should be the same for both project or package mode. This is also quite consistent with the current way. kedro run is just using KedroSession.run and the only job of the cli is parsing the argument.

Why do we need a context manager? Is that a good reason to keep them at all? Note that both extra_params and env are in the context manager because we are saving them to session_store, but it doesn't add anything at the moment. In the past, it has some responsibility of activating/deactivating sessions, but that was removed for a while already. In reality, "closing" a session doesn't do anything. So in theory there can be a kedro_run function which basically just do "Create a Session and run a session".

Very fair points all round here I think. I'm not sure why we have this configure_project step - I'd guess it was introduced when we started doing settings.py and lazy-loading of pipelines etc. If I remember correctly, the original method way to handle KedroSession did not have the context manager, and I'm not sure why it was changed to that. I'd guess that, like you suggest, it's because in the past we used to handle other cases which we don't care about any more (e.g. multiple runs per session, activating/deactivating sessions). Possibly this could be simplified quite a lot now, very nice idea! 👍 Of course it would be quite a breaking change, but might help clean things up more. We'd need to look back and understand what the original motivation for configure_project etc. was, since I'm not sure. Probably @idanov has a better understanding of all this.

run = _find_run_command(package_name)
run(*args, **kwargs)
if kwargs:
run(**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is unfinished? as I can see in the other __main__.py the result of session.run is return but here it doesn't return the output

@merelcht
Copy link
Member

merelcht commented Sep 28, 2022

We discussed this in Technical Design and talked about the following:

  • The package_name argument provided in KedroSession.create() doesn't do anything, which is confusing to users. This argument should be removed, but since that is a breaking change it will happen in 0.19
  • The workaround to run a run within a session is valid, but it's important to note that bootstrap_project is only meant for configuring a project that hasn't been packaged. When working with a packaged project configure_project should be used instead.
  • The problem with the main() method is that it currently returns an exit code, so downstream processes can't do anything with it.

Some steps to take to improve the situation and explain it better to users:

@noklam
Copy link
Contributor

noklam commented Sep 28, 2022

The problem with the main() method is that it currently returns an exit code, so downstream processes can't do anything with it.

Clarify this one a bit. I think these are 2 problems

  1. main should return the session.run() - so downstream processing is possible
  2. click generate a sys.exit() by default, and since main wrap around the CLI so it creates some issue on Databricks or even just IPython - this link explains more.

@merelcht
Copy link
Member

This task isn't done, and is definitely something we want to work on. @AntonyMilneQB will never forget about this, and will make sure it gets discussed properly 😄 This PR will be closed because it has diverged a lot from the main branch and isn't mergeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants