Merge branch 'main' into fix/clean-up-starters-logic

kedro-org · Jan 9, 2024 · 30ce26c · 30ce26c
2 parents 3fe920b + ee793f2
commit 30ce26c
Show file tree

Hide file tree

Showing 12 changed files with 109 additions and 16 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -6,11 +6,14 @@
 * Removed example pipeline requirements when examples are not selected in `tools`.
 * Allowed modern versions of JupyterLab and Jupyter Notebooks.
 * Removed setuptools dependency
+* Added `source_dir` explicitly in `pyproject.toml` for non-src layout project.
+* `MemoryDataset` entries are now included in free outputs.
 
 ## Breaking changes to the API
 * Added logging about not using async mode in `SequentiallRunner` and `ParallelRunner`.
 
 ## Documentation changes
+* Added documentations about `bootstrap_project` and `configure_project`.
 
 ## Community contributions
 

diff --git a/docs/source/faq/faq.md b/docs/source/faq/faq.md
@@ -44,6 +44,7 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website]
 * [How to use global variables with the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-global-variables-with-the-omegaconfigloader)?
 * [How do I use resolvers in the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-resolvers-in-the-omegaconfigloader)?
 * [How do I load credentials through environment variables](../configuration/advanced_configuration.md#how-to-load-credentials-through-environment-variables)?
+* [How do I use Kedro with different project structure?](../kedro_project_setup/settings.md#use-kedro-without-the-src-folder)
 
 
 ## Nodes and pipelines

diff --git a/docs/source/kedro_project_setup/session.md b/docs/source/kedro_project_setup/session.md
@@ -36,4 +36,31 @@ You can provide the following optional arguments in `KedroSession.create()`:
 - `save_on_close`: A boolean value to indicate whether or not to save the session to disk when it's closed
 - `env`: Environment for the `KedroContext`
 - `extra_params`: Optional dictionary containing extra project parameters
-for the underlying `KedroContext`; if specified, this will update (and therefore take precedence over) parameters retrieved from the project configuration
+for the underlying **`KedroContext`**; if specified, this will update (and therefore take precedence over) parameters retrieved from the project configuration
+
+## `bootstrap_project` and `configure_project`
+```{image} ../meta/images/kedro-session-creation.png
+:alt: mermaid-General overview diagram for KedroSession creation
+```
+
+% Mermaid code, see https://github.com/kedro-org/kedro/wiki/Render-Mermaid-diagrams
+% graph LR
+%  subgraph Kedro Startup Flowchart
+%    A[bootstrap_project] -->|Read pyproject.toml| B
+%    A -->|Add project root to sys.path| B[configure_project]
+%    C[Initialize KedroSession]
+%    B --> |Read settings.py| C
+%    B --> |Read pipeline_registry.py| C
+%  end
+
+Both `bootstrap_project` and `configure_project` handle the setup of a Kedro project, but there are subtle differences: `bootstrap_project` is used for project mode, and `configure_project` is used for packaged mode.
+
+Kedro's CLI runs the functions at startup as part of `kedro run` so in most cases you don't need to call these functions. If you want to [interact with a Kedro project programatically in an interactive session such as Notebook](../notebooks_and_ipython/kedro_and_notebooks.md#reload_kedro-line-magic), use `%reload_kedro` line magic with Jupyter or IPython. Only use these functions directly if none of these methods work.
+
+### `bootstrap_project`
+
+This function uses `configure_project`, and additionally reads metadata from `pyproject.toml` and adds the project root to `sys.path` so the project can be imported as a Python package. It is typically used to work directly with the source code of a Kedro project.
+
+### `configure_project`
+
+This function reads `settings.py` and `pipeline_registry.py` and registers the configuration before Kedro's run starts. If you have a packaged Kedro project, you only need to run `configure_project` before executing your pipeline.
diff --git a/docs/source/kedro_project_setup/settings.md b/docs/source/kedro_project_setup/settings.md
@@ -31,6 +31,9 @@ Every Kedro project comes with a default pre-populated `pyproject.toml` file in
 package_name = "package_name"
 project_name = "project_name"
 kedro_init_version = "kedro_version"
+tools = ""
+example_pipeline = "False"
+source_dir = "src"
 ```
 
 The `package_name` should be a [valid Python package name](https://peps.python.org/pep-0423/) and the `project_name` should be a human-readable name. They are both mandatory keys for your project.
@@ -40,3 +43,11 @@ this value should also be updated.
 You can also use `pyproject.toml` to specify settings for functionalities such as [micro-packaging](../nodes_and_pipelines/micro_packaging.md).
 You can also store the settings for the other tools you've used in your project, such as [`pytest` for automated testing](../development/automated_testing.md).
 Consult the respective documentation for the tools you have used to check how you can configure the settings with the `pyproject.toml` file for your project.
+
+### Use Kedro without the `src` folder
+Kedro uses the `src` layout by default. It is possible to change this, for example, to use a [flat layout](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/#src-layout-vs-flat-layout), you can change the `pyproject.toml` as follow.
+
+```diff
++++ source_dir = ""
+--- source_dir = "src"
+```
diff --git a/docs/source/meta/images/kedro-session-creation.png b/docs/source/meta/images/kedro-session-creation.png
diff --git a/features/steps/test_starter/{{ cookiecutter.repo_name }}/pyproject.toml b/features/steps/test_starter/{{ cookiecutter.repo_name }}/pyproject.toml
@@ -37,6 +37,8 @@ project_name = "{{ cookiecutter.project_name }}"
 kedro_init_version = "{{ cookiecutter.kedro_version }}"
 tools = {{ cookiecutter.tools | default('') | string | replace('\"', '\\\"') }}
 example_pipeline = "{{ cookiecutter.example_pipeline }}"
+source_dir = "src"
+
 
 [tool.pytest.ini_options]
 addopts = """

diff --git a/kedro/io/core.py b/kedro/io/core.py
@@ -536,19 +536,19 @@ def _fetch_latest_load_version(self) -> str:
         # When load version is unpinned, fetch the most recent existing
         # version from the given path.
         pattern = str(self._get_versioned_path("*"))
-        version_paths = sorted(self._glob_function(pattern), reverse=True)
+        try:
+            version_paths = sorted(self._glob_function(pattern), reverse=True)
+        except Exception as exc:
+            message = (
+                f"Did not find any versions for {self}. This could be "
+                f"due to insufficient permission. Exception: {exc}"
+            )
+            raise VersionNotFoundError(message) from exc
         most_recent = next(
             (path for path in version_paths if self._exists_function(path)), None
         )
-        protocol = getattr(self, "_protocol", None)
         if not most_recent:
-            if protocol in CLOUD_PROTOCOLS:
-                message = (
-                    f"Did not find any versions for {self}. This could be "
-                    f"due to insufficient permission."
-                )
-            else:
-                message = f"Did not find any versions for {self}"
+            message = f"Did not find any versions for {self}"
             raise VersionNotFoundError(message)
         return PurePath(most_recent).parent.name
 

diff --git a/kedro/runner/runner.py b/kedro/runner/runner.py
@@ -94,9 +94,16 @@ def run(
                 f"Pipeline input(s) {unsatisfied} not found in the DataCatalog"
             )
 
+        # Identify MemoryDataset in the catalog
+        memory_datasets = {
+            ds_name
+            for ds_name, ds in catalog._datasets.items()
+            if isinstance(ds, MemoryDataset)
+        }
+
         # Check if there's any output datasets that aren't in the catalog and don't match a pattern
-        # in the catalog.
-        free_outputs = pipeline.outputs() - set(registered_ds)
+        # in the catalog and include MemoryDataset.
+        free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)
 
         # Register the default dataset pattern with the catalog
         catalog = catalog.shallow_copy(

diff --git a/kedro/templates/project/{{ cookiecutter.repo_name }}/pyproject.toml b/kedro/templates/project/{{ cookiecutter.repo_name }}/pyproject.toml
@@ -39,6 +39,7 @@ project_name = "{{ cookiecutter.project_name }}"
 kedro_init_version = "{{ cookiecutter.kedro_version }}"
 tools = {{ cookiecutter.tools | default('') | string | replace('\"', '\\\"') }}
 example_pipeline = "{{ cookiecutter.example_pipeline }}"
+source_dir = "src"
 
 [tool.pytest.ini_options]
 addopts = """

diff --git a/pyproject.toml b/pyproject.toml
@@ -58,13 +58,12 @@ test = [
     "blacken-docs==1.9.2",
     "black~=22.0",
     "coverage[toml]",
-    "fsspec<2023.9", # Temporary, newer version causing "test_no_versions_with_cloud_protocol" to fail
     "import-linter==1.12.1",
     "ipython>=7.31.1, <8.0; python_version < '3.8'",
     "ipython~=8.10; python_version >= '3.8'",
     "Jinja2<3.1.0",
     "jupyterlab_server>=2.11.1",
-    "jupyterlab~=3.0",
+    "jupyterlab>=3,<5",
     "jupyter~=1.0",
     "kedro-datasets",
     "moto==1.3.7; python_version < '3.10'",
@@ -85,7 +84,7 @@ test = [
 ]
 docs = [
     "docutils<0.18",
-    "sphinx~=5.3.0",
+    "sphinx>=5.3,<7.3",
     "sphinx_rtd_theme==1.2.0",
     # Regression on sphinx-autodoc-typehints 1.21
     # that creates some problematic docstrings

diff --git a/tests/runner/conftest.py b/tests/runner/conftest.py
@@ -165,3 +165,13 @@ def two_branches_crossed_pipeline():
             node(identity, "ds3_B", "ds4_B", name="node4_B"),
         ]
     )
+
+
+@pytest.fixture
+def pipeline_with_memory_datasets():
+    return pipeline(
+        [
+            node(func=identity, inputs="Input1", outputs="MemOutput1", name="node1"),
+            node(func=identity, inputs="Input2", outputs="MemOutput2", name="node2"),
+        ]
+    )
diff --git a/tests/runner/test_sequential_runner.py b/tests/runner/test_sequential_runner.py
@@ -7,7 +7,13 @@
 import pytest
 
 from kedro.framework.hooks import _create_hook_manager
-from kedro.io import AbstractDataset, DataCatalog, DatasetError, LambdaDataset
+from kedro.io import (
+    AbstractDataset,
+    DataCatalog,
+    DatasetError,
+    LambdaDataset,
+    MemoryDataset,
+)
 from kedro.pipeline import node
 from kedro.pipeline.modular_pipeline import pipeline as modular_pipeline
 from kedro.runner import SequentialRunner
@@ -279,3 +285,29 @@ def test_suggest_resume_scenario(
                 hook_manager=_create_hook_manager(),
             )
         assert re.search(expected_pattern, caplog.text)
+
+
+class TestMemoryDatasetBehaviour:
+    def test_run_includes_memory_datasets(self, pipeline_with_memory_datasets):
+        # Create a catalog with MemoryDataset entries and inputs for the pipeline
+        catalog = DataCatalog(
+            {
+                "Input1": LambdaDataset(load=lambda: "data1", save=lambda data: None),
+                "Input2": LambdaDataset(load=lambda: "data2", save=lambda data: None),
+                "MemOutput1": MemoryDataset(),
+                "MemOutput2": MemoryDataset(),
+            }
+        )
+
+        # Add a regular dataset to the catalog
+        catalog.add("RegularOutput", LambdaDataset(None, None, lambda: True))
+
+        # Run the pipeline
+        output = SequentialRunner().run(pipeline_with_memory_datasets, catalog)
+
+        # Check that MemoryDataset outputs are included in the run results
+        assert "MemOutput1" in output
+        assert "MemOutput2" in output
+        assert (
+            "RegularOutput" not in output
+        )  # This output is registered in DataCatalog and so should not be in free outputs