Merge branch 'main' into fix/exists-method-for-shared-memory-dataset

Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
kedro-org · Sep 24, 2024 · 045d130 · 045d130
2 parents caa5b1d + 53280bd
commit 045d130
Show file tree

Hide file tree

Showing 47 changed files with 1,955 additions and 615 deletions.
diff --git a/.github/styles/Kedro/ignore.txt b/.github/styles/Kedro/ignore.txt
@@ -44,3 +44,5 @@ transcoding
 transcode
 Claypot
 ethanknights
+Aneira
+Printify
diff --git a/.github/workflows/benchmark-performance.yml b/.github/workflows/benchmark-performance.yml
@@ -0,0 +1,59 @@
+name: ASV Benchmark
+
+on:
+  push:
+    branches:
+      - main  # Run benchmarks on every commit to the main branch
+  workflow_dispatch:
+
+
+jobs:
+
+  benchmark:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          path: "kedro"
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install asv  # Install ASV
+
+      - name: Run ASV benchmarks
+        run: |
+          cd kedro
+          asv machine --machine=github-actions
+          asv run -v --machine=github-actions
+
+      - name: Set git email and name
+        run: |
+          git config --global user.email "kedro@kedro.com"
+          git config --global user.name "Kedro"
+
+      - name: Checkout target repository
+        uses: actions/checkout@v4
+        with:
+          repository: kedro-org/kedro-benchmark-results
+          token: ${{ secrets.GH_TAGGING_TOKEN }}
+          ref: 'main'
+          path: "kedro-benchmark-results"
+
+      - name: Copy files to target repository
+        run: |
+          cp -r /home/runner/work/kedro/kedro/kedro/.asv /home/runner/work/kedro/kedro/kedro-benchmark-results/
+
+      - name: Commit and Push changes to kedro-org/kedro-benchmark-results
+        run: |
+          cd kedro-benchmark-results
+          git add .
+          git commit -m "Add results"
+          git push
diff --git a/RELEASE.md b/RELEASE.md
@@ -1,11 +1,23 @@
 # Upcoming Release
 
 ## Major features and improvements
+* Implemented `KedroDataCatalog` repeating `DataCatalog` functionality with a few API enhancements:
+  * Removed `_FrozenDatasets` and access datasets as properties;
+  * Added get dataset by name feature;
+  * `add_feed_dict()` was simplified and renamed to `add_data()`;
+  * Datasets' initialisation was moved out from `from_config()` method to the constructor.
+* Moved development requirements from `requirements.txt` to the dedicated section in `pyproject.toml` for project template.
+* Implemented `Protocol` abstraction for the current `DataCatalog` and adding new catalog implementations.
+* Refactored `kedro run` and `kedro catalog` commands.
+* Moved pattern resolution logic from `DataCatalog` to a separate component - `CatalogConfigResolver`. Updated `DataCatalog` to use `CatalogConfigResolver` internally.
 * Made packaged Kedro projects return `session.run()` output to be used when running it in the interactive environment.
 * Enhanced `OmegaConfigLoader` configuration validation to detect duplicate keys at all parameter levels, ensuring comprehensive nested key checking.
 ## Bug fixes and other changes
 * Fixed bug where using dataset factories breaks with `ThreadRunner`.
 * Fixed a bug where `SharedMemoryDataset.exists` would not call the underlying `MemoryDataset`.
+* Fixed template projects example tests.
+* Made credentials loading consistent between `KedroContext._get_catalog()` and `resolve_patterns` so that both us
+e `_get_config_credentials()`
 
 ## Breaking changes to the API
 * Removed `ShelveStore` to address a security vulnerability.
@@ -17,6 +29,9 @@
 ## Community contributions
 * [Puneet](https://github.com/puneeter)
 * [ethanknights](https://github.com/ethanknights)
+* [Manezki](https://github.com/Manezki)
+* [MigQ2](https://github.com/MigQ2)
+* [Felix Scherz](https://github.com/felixscherz)
 
 # Release 0.19.8
 

diff --git a/asv.conf.json b/asv.conf.json
@@ -0,0 +1,12 @@
+{
+    "version": 1,
+    "project": "Kedro",
+    "project_url": "https://kedro.org/",
+    "repo": ".",
+    "install_command": ["pip install -e ."],
+    "branches": ["main"],
+    "environment_type": "virtualenv",
+    "show_commit_url": "http://github.com/kedro-org/kedro/commit/",
+    "results_dir": ".asv/results",
+    "html_dir": ".asv/html"
+}
diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py
diff --git a/benchmarks/benchmark_dummy.py b/benchmarks/benchmark_dummy.py
@@ -0,0 +1,16 @@
+# Write the benchmarking functions here.
+# See "Writing benchmarks" in the asv docs for more information.
+
+
+class TimeSuite:
+    """
+    A dummy benchmark suite to test with asv framework.
+    """
+    def setup(self):
+        self.d = {}
+        for x in range(500):
+            self.d[x] = None
+
+    def time_keys(self):
+        for key in self.d.keys():
+            pass
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -127,11 +127,14 @@
         "typing.Type",
         "typing.Set",
         "kedro.config.config.ConfigLoader",
+        "kedro.io.catalog_config_resolver.CatalogConfigResolver",
         "kedro.io.core.AbstractDataset",
         "kedro.io.core.AbstractVersionedDataset",
+        "kedro.io.core.CatalogProtocol",
         "kedro.io.core.DatasetError",
         "kedro.io.core.Version",
         "kedro.io.data_catalog.DataCatalog",
+        "kedro.io.kedro_data_catalog.KedroDataCatalog",
         "kedro.io.memory_dataset.MemoryDataset",
         "kedro.io.partitioned_dataset.PartitionedDataset",
         "kedro.pipeline.pipeline.Pipeline",
@@ -168,6 +171,9 @@
         "D[k] if k in D, else d.  d defaults to None.",
         "None.  Update D from mapping/iterable E and F.",
         "Patterns",
+        "CatalogConfigResolver",
+        "CatalogProtocol",
+        "KedroDataCatalog",
     ),
     "py:data": (
         "typing.Any",

diff --git a/docs/source/contribution/technical_steering_committee.md b/docs/source/contribution/technical_steering_committee.md
@@ -61,10 +61,10 @@ We look for commitment markers who can do the following:
 | [Huong Nguyen](https://github.com/Huongg)                | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
 | [Ivan Danov](https://github.com/idanov)                  | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
 | [Jitendra Gundaniya](https://github.com/jitu5)           | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
-| [Joel Schwarzmann](https://github.com/datajoely)         | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
+| [Joel Schwarzmann](https://github.com/datajoely)         | [Aneira Health](https://www.aneira.health)                                              |
 | [Juan Luis Cano](https://github.com/astrojuanlu)         | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
 | [Laura Couto](https://github.com/lrcouto)                | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
-| [Marcin Zabłocki](https://github.com/marrrcin)           | [Printify, Inc.](https://printify.com/)  |
+| [Marcin Zabłocki](https://github.com/marrrcin)           | [Printify, Inc.](https://printify.com/)                                                 |
 | [Merel Theisen](https://github.com/merelcht)             | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
 | [Nok Lam Chan](https://github.com/noklam)                | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |
 | [Rashida Kanchwala](https://github.com/rashidakanchwala) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack)      |

diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md
@@ -4,7 +4,7 @@
 
 ## AbstractDataset
 
-If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
+If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to implement the `load` and `save` methods while providing wrappers that enrich the corresponding methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
 
 
 ## Scenario
@@ -31,8 +31,8 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta
 
 At the minimum, a valid Kedro dataset needs to subclass the base {py:class}`~kedro.io.AbstractDataset` and provide an implementation for the following abstract methods:
 
-* `_load`
-* `_save`
+* `load`
+* `save`
 * `_describe`
 
 `AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
@@ -70,15 +70,15 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
         """
         self._filepath = filepath
 
-    def _load(self) -> np.ndarray:
+    def load(self) -> np.ndarray:
         """Loads data from the image file.
 
         Returns:
             Data from the image file as a numpy array.
         """
         ...
 
-    def _save(self, data: np.ndarray) -> None:
+    def save(self, data: np.ndarray) -> None:
         """Saves image data to the specified filepath"""
         ...
 
@@ -96,11 +96,11 @@ src/kedro_pokemon/datasets
 └── image_dataset.py
 ```
 
-## Implement the `_load` method with `fsspec`
+## Implement the `load` method with `fsspec`
 
 Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats.
 
-Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
+Here is the implementation of the `load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
 
 <details>
 <summary><b>Click to expand</b></summary>
@@ -130,7 +130,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
         self._filepath = PurePosixPath(path)
         self._fs = fsspec.filesystem(self._protocol)
 
-    def _load(self) -> np.ndarray:
+    def load(self) -> np.ndarray:
         """Loads data from the image file.
 
         Returns:
@@ -168,14 +168,14 @@ In [2]: from PIL import Image
 In [3]: Image.fromarray(image).show()
 ```
 
-## Implement the `_save` method with `fsspec`
+## Implement the `save` method with `fsspec`
 
 Similarly, we can implement the `_save` method as follows:
 
 
 ```python
 class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
-    def _save(self, data: np.ndarray) -> None:
+    def save(self, data: np.ndarray) -> None:
         """Saves image data to the specified filepath."""
         # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
         save_path = get_filepath_str(self._filepath, self._protocol)
@@ -243,7 +243,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
         self._filepath = PurePosixPath(path)
         self._fs = fsspec.filesystem(self._protocol)
 
-    def _load(self) -> np.ndarray:
+    def load(self) -> np.ndarray:
         """Loads data from the image file.
 
         Returns:
@@ -254,7 +254,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
             image = Image.open(f).convert("RGBA")
             return np.asarray(image)
 
-    def _save(self, data: np.ndarray) -> None:
+    def save(self, data: np.ndarray) -> None:
         """Saves image data to the specified filepath."""
         save_path = get_filepath_str(self._filepath, self._protocol)
         with self._fs.open(save_path, mode="wb") as f:
@@ -312,7 +312,7 @@ To add versioning support to the new dataset we need to extend the
  {py:class}`~kedro.io.AbstractVersionedDataset` to:
 
 * Accept a `version` keyword argument as part of the constructor
-* Adapt the `_load` and `_save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively
+* Adapt the `load` and `save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively
 
 The following amends the full implementation of our basic `ImageDataset`. It now loads and saves data to and from a versioned subfolder (`data/01_raw/pokemon-images-and-types/images/images/pikachu.png/<version>/pikachu.png` with `version` being a datetime-formatted string `YYYY-MM-DDThh.mm.ss.sssZ` by default):
 
@@ -359,7 +359,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
             glob_function=self._fs.glob,
         )
 
-    def _load(self) -> np.ndarray:
+    def load(self) -> np.ndarray:
         """Loads data from the image file.
 
         Returns:
@@ -370,7 +370,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
             image = Image.open(f).convert("RGBA")
             return np.asarray(image)
 
-    def _save(self, data: np.ndarray) -> None:
+    def save(self, data: np.ndarray) -> None:
         """Saves image data to the specified filepath."""
         save_path = get_filepath_str(self._get_save_path(), self._protocol)
         with self._fs.open(save_path, mode="wb") as f:
@@ -435,7 +435,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
 +            glob_function=self._fs.glob,
 +        )
 +
-     def _load(self) -> np.ndarray:
+     def load(self) -> np.ndarray:
          """Loads data from the image file.
 
          Returns:
@@ -447,7 +447,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
              image = Image.open(f).convert("RGBA")
              return np.asarray(image)
 
-     def _save(self, data: np.ndarray) -> None:
+     def save(self, data: np.ndarray) -> None:
          """Saves image data to the specified filepath."""
 -        save_path = get_filepath_str(self._filepath, self._protocol)
 +        save_path = get_filepath_str(self._get_save_path(), self._protocol)

diff --git a/docs/source/development/automated_testing.md b/docs/source/development/automated_testing.md
@@ -19,21 +19,36 @@ There are many testing frameworks available for Python. One of the most popular
 
 Let's look at how you can start working with `pytest` in your Kedro project.
 
-### Prerequisite: Install your Kedro project
+### Install test requirements
+Before getting started with test requirements, it is important to ensure you have installed your project locally. This allows you to test different parts of your project by importing them into your test files.
+
+
+To install your project including all the project-specific dependencies and test requirements:
+1. Add the following section to the `pyproject.toml` file located in the project root:
+```toml
+[project.optional-dependencies]
+dev = [
+    "pytest-cov",
+    "pytest-mock",
+    "pytest",
+]
+```
+
+2. Navigate to the root directory of the project and run:
+```bash
+pip install ."[dev]"
+```
 
-Before getting started with `pytest`, it is important to ensure you have installed your project locally. This allows you to test different parts of your project by importing them into your test files.
+Alternatively, you can individually install test requirements as you would install other packages with `pip`, making sure you have installed your project locally and your [project's virtual environment is active](../get_started/install.md#create-a-virtual-environment-for-your-kedro-project).
 
-To install your project, navigate to your project root and run the following command:
+1. To install your project, navigate to your project root and run the following command:
 
 ```bash
 pip install -e .
 ```
-
 >**NOTE**: The option `-e` installs an editable version of your project, allowing you to make changes to the project files without needing to re-install them each time.
-### Install `pytest`
-
-Install `pytest` as you would install other packages with `pip`, making sure your [project's virtual environment is active](../get_started/install.md#create-a-virtual-environment-for-your-kedro-project).
 
+2. Install test requirements one by one:
 ```bash
 pip install pytest
 ```

diff --git a/docs/source/development/linting.md b/docs/source/development/linting.md
@@ -18,17 +18,17 @@ There are a variety of Python tools available to use with your Kedro projects. T
 type.
 
 ### Install the tools
-Install `ruff` by adding the following lines to your project's `requirements.txt`
-file:
-```text
-ruff # Used for linting, formatting and sorting module imports
+To install `ruff` add the following section to the `pyproject.toml` file located in the project root:
+```toml
+[project.optional-dependencies]
+dev = ["ruff"]
 ```
 
-To install all the project-specific dependencies, including the linting tools, navigate to the root directory of the
+Then to install your project including all the project-specific dependencies and the linting tools, navigate to the root directory of the
 project and run:
 
 ```bash
-pip install -r requirements.txt
+pip install ."[dev]"
 ```
 
 Alternatively, you can individually install the linting tools using the following shell commands:

diff --git a/docs/source/meta/images/slice_pipeline_kedro_viz.gif b/docs/source/meta/images/slice_pipeline_kedro_viz.gif
diff --git a/docs/source/nodes_and_pipelines/slice_a_pipeline.md b/docs/source/nodes_and_pipelines/slice_a_pipeline.md
@@ -1,6 +1,13 @@
 # Slice a pipeline
 
-Sometimes it is desirable to run a subset, or a 'slice' of a pipeline's nodes. In this page, we illustrate the programmatic options that Kedro provides. You can also use the [Kedro CLI to pass parameters to `kedro run`](../development/commands_reference.md#run-the-project) command and slice a pipeline.
+Sometimes it is desirable to run a subset, or a 'slice' of a pipeline's nodes. There are two primary ways to achieve this:
+
+
+1. **Visually through Kedro-Viz:** This approach allows you to visually choose and slice pipeline nodes, which then generates a run command for executing the slice within your Kedro project. Detailed steps on how to achieve this are available in the Kedro-Viz documentation: [Slice a Pipeline](https://docs.kedro.org/projects/kedro-viz/en/stable/slice_a_pipeline.html).
+
+![](../meta/images/slice_pipeline_kedro_viz.gif)
+
+2. **Programmatically with the Kedro CLI.** You can also use the [Kedro CLI to pass parameters to `kedro run`](../development/commands_reference.md#run-the-project) command and slice a pipeline. In this page, we illustrate the programmatic options that Kedro provides.
 
 Let's look again at the example pipeline from the [pipeline introduction documentation](./pipeline_introduction.md#how-to-build-a-pipeline), which computes the variance of a set of numbers:
 

diff --git a/features/load_node.feature b/features/load_node.feature
@@ -5,5 +5,6 @@ Feature: load_node in new project
     And I have run a non-interactive kedro new with starter "default"
 
   Scenario: Execute ipython load_node magic
-    When I execute the load_node magic command
+    When I install project and its dev dependencies
+    And I execute the load_node magic command
     Then the logs should show that load_node executed successfully