Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical design decision record for KedroSession #1335

Closed
merelcht opened this issue Mar 9, 2022 · 7 comments · Fixed by #1329
Closed

Technical design decision record for KedroSession #1335

merelcht opened this issue Mar 9, 2022 · 7 comments · Fixed by #1329
Labels
Type: Technical DR 💾 Decision Records (technical decisions made)

Comments

@merelcht
Copy link
Member

merelcht commented Mar 9, 2022

The KedroSession

The KedroSession is the object responsible for managing the lifecycle of a Kedro run. It has two main functions:

  1. Run execution: It makes sure that all core components needed by Kedro to execute a run are instantiated and the run is executed properly
  2. Persisting run data: KedroSession offers a way to persist run data through the session store. The following data gets saved in the session store:
  • package_name
  • project_path
  • session_id
  • CLI info: command run, run parameters
  • Git info: git sha, git branch, is branch dirty or not

Usage within Kedro 🏗

The KedroSession is a relatively new component within Kedro and at the time of writing, is mainly used to manage run lifecycles and for experiment tracking. The experiment tracking feature makes use of a session store implementation called the SQLiteStore, which uses SQLite to persist data. Other implementations of the session store available in Kedro are:

  1. BaseSessionStore: the base class for all session stores that doesn’t persist any data
  2. ShelveStore: implementation that uses the shelve package to persist data

Relation of a run and a session 🧑‍🤝‍🧑

While working on #1273 it was decided that Kedro session and Kedro run have a 1-1 mapping. This means that when a session gets created it will only ever be possible to kick off one full pipeline run during that specific session’s existence. In practice, Kedro manages this for you under the hood when kedro run is executed.

FAQ ❓

How does a Kedro user use KedroSession?
As a Kedro user you don’t need to access the session directly. When you execute the kedro run command, a new session gets created automatically. This session will then kick off the pipeline run and when that process finishes, the session will be closed again persisting any run data if the project is configured with a persistent session store.

What about using KedroSession in an interactive workflow?
When using jupyter or ipython you can access the active session object or create a new one. You can then retrieve the session_id, the run data that will be stored, load the context, and execute a run. However, we do not encourage users to use the session other than for checking the session_id and run data.

Related Github issues and PRs:

@merelcht merelcht changed the title KedroSession technical design decision based on https://github.com/kedro-org/kedro/issues/1273 Technical design doc for KedroSession Mar 9, 2022
@merelcht merelcht added the Type: Technical DR 💾 Decision Records (technical decisions made) label Mar 9, 2022
@merelcht merelcht linked a pull request Mar 9, 2022 that will close this issue
5 tasks
@datajoely
Copy link
Contributor

A question that I think myself and others will ask is - if I want to access the data catalog as a live object, do I need to create a session for that? Is that the right way?

@merelcht
Copy link
Member Author

merelcht commented Mar 9, 2022

A question that I think myself and others will ask is - if I want to access the data catalog as a live object, do I need to create a session for that? Is that the right way?

The catalog is provided as variable just like the session, context and pipelines: https://kedro.readthedocs.io/en/stable/11_tools_integration/02_ipython.html#load-datacatalog-in-ipython

@datajoely
Copy link
Contributor

@MerelTheisenQB I get that - but users will need to access it in other contexts such as plug-ins and (although not recommended) dynamic contexts. Is there scope to make the catalog importable like the pipelines object is?

@antonymilne
Copy link
Contributor

antonymilne commented Mar 9, 2022

@datajoely Personally I would like this (unless there's some strong arguments against it that I've forgotten), but I think it's outside the scope for now at least. When we talked about it before it didn't seem as easy to do as it is for pipelines unfortunately.


Just one comment on kedro session in the interactive workflow: eventually I wonder whether we should stop exposing session in ipython/jupyter at all, i.e. should we remove this line.

My immediate concern is that someone could end up saving to the session store sessions when they are not even doing session.run but just doing some data exploration (although it takes some effort to do so since you need to call session.close explicitly), and then the experiment tracking has empty runs in it. We could prevent this already by passing save_on_close=False here so that even calling session.close wouldn't save to the session store.

More generally though, I wonder whether there will be any good uses of session in the interactive workflow in the future. Once we're working on this scheme, it seems like a bit of an anti-pattern so maybe not something we should have available at all for users. I mentioned this to @idanov today and he seemed to be in favour of not exposing it. Interested to hear what others (@noklam?) think though, and whether it's important to be able to do session.run (or other session) stuff from a notebook.

@noklam
Copy link
Contributor

noklam commented Mar 9, 2022

@AntonyMilneQB For me, it's the ability to do checkpoint debugging in an interactive environment that matters. It may be I am not doing it in a right way, but I am interested in how others are using the Kedro Ipython/notebook other than EDA.

Just to recap, this is the workflow that I adopted in the past for development.

  1. Run a partial pipeline and stop at the point of interest.
  2. Do whatever I needed in a notebook environment. i.e. Changing the definition of a node / injecting / overwriting some of the data in catalog.
  3. Continue to run the pipeline until I get my desired output.

@lorenabalan
Copy link
Contributor

@AntonyMilneQB I think not being able to run anything in the jupyter notebook / ipython takes away a lot from jupyter users we're trying to convert to Python and Kedro. If we do that we need to seriously consider the consequences and clearly draw the boundaries of our target audience, because it sounds like they would be very different.

@antonymilne antonymilne added the Component: Jupyter/IPython Issue/PR relevant for Jupyter Notebooks, IPython sessions and the interactive workflow in Kedro label Apr 7, 2022
@antonymilne antonymilne removed the Component: Jupyter/IPython Issue/PR relevant for Jupyter Notebooks, IPython sessions and the interactive workflow in Kedro label Jun 8, 2022
@stichbury stichbury changed the title Technical design doc for KedroSession Technical design decision record for KedroSession Jul 5, 2023
@merelcht
Copy link
Member Author

Closing this as there's no immediate actions remaining for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Technical DR 💾 Decision Records (technical decisions made)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants