Skip to content

Commit

Permalink
Merge branch 'main' into fix/clean-up-starters-logic
Browse files Browse the repository at this point in the history
  • Loading branch information
AhdraMeraliQB authored Jan 9, 2024
2 parents 3fe920b + ee793f2 commit 30ce26c
Show file tree
Hide file tree
Showing 12 changed files with 109 additions and 16 deletions.
3 changes: 3 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@
* Removed example pipeline requirements when examples are not selected in `tools`.
* Allowed modern versions of JupyterLab and Jupyter Notebooks.
* Removed setuptools dependency
* Added `source_dir` explicitly in `pyproject.toml` for non-src layout project.
* `MemoryDataset` entries are now included in free outputs.

## Breaking changes to the API
* Added logging about not using async mode in `SequentiallRunner` and `ParallelRunner`.

## Documentation changes
* Added documentations about `bootstrap_project` and `configure_project`.

## Community contributions

Expand Down
1 change: 1 addition & 0 deletions docs/source/faq/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ This is a growing set of technical FAQs. The [product FAQs on the Kedro website]
* [How to use global variables with the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-global-variables-with-the-omegaconfigloader)?
* [How do I use resolvers in the `OmegaConfigLoader`](../configuration/advanced_configuration.md#how-to-use-resolvers-in-the-omegaconfigloader)?
* [How do I load credentials through environment variables](../configuration/advanced_configuration.md#how-to-load-credentials-through-environment-variables)?
* [How do I use Kedro with different project structure?](../kedro_project_setup/settings.md#use-kedro-without-the-src-folder)


## Nodes and pipelines
Expand Down
29 changes: 28 additions & 1 deletion docs/source/kedro_project_setup/session.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,31 @@ You can provide the following optional arguments in `KedroSession.create()`:
- `save_on_close`: A boolean value to indicate whether or not to save the session to disk when it's closed
- `env`: Environment for the `KedroContext`
- `extra_params`: Optional dictionary containing extra project parameters
for the underlying `KedroContext`; if specified, this will update (and therefore take precedence over) parameters retrieved from the project configuration
for the underlying **`KedroContext`**; if specified, this will update (and therefore take precedence over) parameters retrieved from the project configuration

## `bootstrap_project` and `configure_project`
```{image} ../meta/images/kedro-session-creation.png
:alt: mermaid-General overview diagram for KedroSession creation
```

% Mermaid code, see https://github.com/kedro-org/kedro/wiki/Render-Mermaid-diagrams
% graph LR
% subgraph Kedro Startup Flowchart
% A[bootstrap_project] -->|Read pyproject.toml| B
% A -->|Add project root to sys.path| B[configure_project]
% C[Initialize KedroSession]
% B --> |Read settings.py| C
% B --> |Read pipeline_registry.py| C
% end

Both `bootstrap_project` and `configure_project` handle the setup of a Kedro project, but there are subtle differences: `bootstrap_project` is used for project mode, and `configure_project` is used for packaged mode.

Kedro's CLI runs the functions at startup as part of `kedro run` so in most cases you don't need to call these functions. If you want to [interact with a Kedro project programatically in an interactive session such as Notebook](../notebooks_and_ipython/kedro_and_notebooks.md#reload_kedro-line-magic), use `%reload_kedro` line magic with Jupyter or IPython. Only use these functions directly if none of these methods work.

### `bootstrap_project`

This function uses `configure_project`, and additionally reads metadata from `pyproject.toml` and adds the project root to `sys.path` so the project can be imported as a Python package. It is typically used to work directly with the source code of a Kedro project.

### `configure_project`

This function reads `settings.py` and `pipeline_registry.py` and registers the configuration before Kedro's run starts. If you have a packaged Kedro project, you only need to run `configure_project` before executing your pipeline.
11 changes: 11 additions & 0 deletions docs/source/kedro_project_setup/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ Every Kedro project comes with a default pre-populated `pyproject.toml` file in
package_name = "package_name"
project_name = "project_name"
kedro_init_version = "kedro_version"
tools = ""
example_pipeline = "False"
source_dir = "src"
```

The `package_name` should be a [valid Python package name](https://peps.python.org/pep-0423/) and the `project_name` should be a human-readable name. They are both mandatory keys for your project.
Expand All @@ -40,3 +43,11 @@ this value should also be updated.
You can also use `pyproject.toml` to specify settings for functionalities such as [micro-packaging](../nodes_and_pipelines/micro_packaging.md).
You can also store the settings for the other tools you've used in your project, such as [`pytest` for automated testing](../development/automated_testing.md).
Consult the respective documentation for the tools you have used to check how you can configure the settings with the `pyproject.toml` file for your project.

### Use Kedro without the `src` folder
Kedro uses the `src` layout by default. It is possible to change this, for example, to use a [flat layout](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/#src-layout-vs-flat-layout), you can change the `pyproject.toml` as follow.

```diff
+++ source_dir = ""
--- source_dir = "src"
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ project_name = "{{ cookiecutter.project_name }}"
kedro_init_version = "{{ cookiecutter.kedro_version }}"
tools = {{ cookiecutter.tools | default('') | string | replace('\"', '\\\"') }}
example_pipeline = "{{ cookiecutter.example_pipeline }}"
source_dir = "src"


[tool.pytest.ini_options]
addopts = """
Expand Down
18 changes: 9 additions & 9 deletions kedro/io/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,19 +536,19 @@ def _fetch_latest_load_version(self) -> str:
# When load version is unpinned, fetch the most recent existing
# version from the given path.
pattern = str(self._get_versioned_path("*"))
version_paths = sorted(self._glob_function(pattern), reverse=True)
try:
version_paths = sorted(self._glob_function(pattern), reverse=True)
except Exception as exc:
message = (
f"Did not find any versions for {self}. This could be "
f"due to insufficient permission. Exception: {exc}"
)
raise VersionNotFoundError(message) from exc
most_recent = next(
(path for path in version_paths if self._exists_function(path)), None
)
protocol = getattr(self, "_protocol", None)
if not most_recent:
if protocol in CLOUD_PROTOCOLS:
message = (
f"Did not find any versions for {self}. This could be "
f"due to insufficient permission."
)
else:
message = f"Did not find any versions for {self}"
message = f"Did not find any versions for {self}"
raise VersionNotFoundError(message)
return PurePath(most_recent).parent.name

Expand Down
11 changes: 9 additions & 2 deletions kedro/runner/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,16 @@ def run(
f"Pipeline input(s) {unsatisfied} not found in the DataCatalog"
)

# Identify MemoryDataset in the catalog
memory_datasets = {
ds_name
for ds_name, ds in catalog._datasets.items()
if isinstance(ds, MemoryDataset)
}

# Check if there's any output datasets that aren't in the catalog and don't match a pattern
# in the catalog.
free_outputs = pipeline.outputs() - set(registered_ds)
# in the catalog and include MemoryDataset.
free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)

# Register the default dataset pattern with the catalog
catalog = catalog.shallow_copy(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ project_name = "{{ cookiecutter.project_name }}"
kedro_init_version = "{{ cookiecutter.kedro_version }}"
tools = {{ cookiecutter.tools | default('') | string | replace('\"', '\\\"') }}
example_pipeline = "{{ cookiecutter.example_pipeline }}"
source_dir = "src"

[tool.pytest.ini_options]
addopts = """
Expand Down
5 changes: 2 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,12 @@ test = [
"blacken-docs==1.9.2",
"black~=22.0",
"coverage[toml]",
"fsspec<2023.9", # Temporary, newer version causing "test_no_versions_with_cloud_protocol" to fail
"import-linter==1.12.1",
"ipython>=7.31.1, <8.0; python_version < '3.8'",
"ipython~=8.10; python_version >= '3.8'",
"Jinja2<3.1.0",
"jupyterlab_server>=2.11.1",
"jupyterlab~=3.0",
"jupyterlab>=3,<5",
"jupyter~=1.0",
"kedro-datasets",
"moto==1.3.7; python_version < '3.10'",
Expand All @@ -85,7 +84,7 @@ test = [
]
docs = [
"docutils<0.18",
"sphinx~=5.3.0",
"sphinx>=5.3,<7.3",
"sphinx_rtd_theme==1.2.0",
# Regression on sphinx-autodoc-typehints 1.21
# that creates some problematic docstrings
Expand Down
10 changes: 10 additions & 0 deletions tests/runner/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,13 @@ def two_branches_crossed_pipeline():
node(identity, "ds3_B", "ds4_B", name="node4_B"),
]
)


@pytest.fixture
def pipeline_with_memory_datasets():
return pipeline(
[
node(func=identity, inputs="Input1", outputs="MemOutput1", name="node1"),
node(func=identity, inputs="Input2", outputs="MemOutput2", name="node2"),
]
)
34 changes: 33 additions & 1 deletion tests/runner/test_sequential_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,13 @@
import pytest

from kedro.framework.hooks import _create_hook_manager
from kedro.io import AbstractDataset, DataCatalog, DatasetError, LambdaDataset
from kedro.io import (
AbstractDataset,
DataCatalog,
DatasetError,
LambdaDataset,
MemoryDataset,
)
from kedro.pipeline import node
from kedro.pipeline.modular_pipeline import pipeline as modular_pipeline
from kedro.runner import SequentialRunner
Expand Down Expand Up @@ -279,3 +285,29 @@ def test_suggest_resume_scenario(
hook_manager=_create_hook_manager(),
)
assert re.search(expected_pattern, caplog.text)


class TestMemoryDatasetBehaviour:
def test_run_includes_memory_datasets(self, pipeline_with_memory_datasets):
# Create a catalog with MemoryDataset entries and inputs for the pipeline
catalog = DataCatalog(
{
"Input1": LambdaDataset(load=lambda: "data1", save=lambda data: None),
"Input2": LambdaDataset(load=lambda: "data2", save=lambda data: None),
"MemOutput1": MemoryDataset(),
"MemOutput2": MemoryDataset(),
}
)

# Add a regular dataset to the catalog
catalog.add("RegularOutput", LambdaDataset(None, None, lambda: True))

# Run the pipeline
output = SequentialRunner().run(pipeline_with_memory_datasets, catalog)

# Check that MemoryDataset outputs are included in the run results
assert "MemOutput1" in output
assert "MemOutput2" in output
assert (
"RegularOutput" not in output
) # This output is registered in DataCatalog and so should not be in free outputs

0 comments on commit 30ce26c

Please sign in to comment.