Skip to content

Commit

Permalink
Merge branch 'main' into fix/exists-method-for-shared-memory-dataset
Browse files Browse the repository at this point in the history
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
  • Loading branch information
ankatiyar committed Sep 24, 2024
2 parents caa5b1d + 53280bd commit 045d130
Show file tree
Hide file tree
Showing 47 changed files with 1,955 additions and 615 deletions.
2 changes: 2 additions & 0 deletions .github/styles/Kedro/ignore.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,5 @@ transcoding
transcode
Claypot
ethanknights
Aneira
Printify
59 changes: 59 additions & 0 deletions .github/workflows/benchmark-performance.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: ASV Benchmark

on:
push:
branches:
- main # Run benchmarks on every commit to the main branch
workflow_dispatch:


jobs:

benchmark:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
path: "kedro"

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install asv # Install ASV
- name: Run ASV benchmarks
run: |
cd kedro
asv machine --machine=github-actions
asv run -v --machine=github-actions
- name: Set git email and name
run: |
git config --global user.email "kedro@kedro.com"
git config --global user.name "Kedro"
- name: Checkout target repository
uses: actions/checkout@v4
with:
repository: kedro-org/kedro-benchmark-results
token: ${{ secrets.GH_TAGGING_TOKEN }}
ref: 'main'
path: "kedro-benchmark-results"

- name: Copy files to target repository
run: |
cp -r /home/runner/work/kedro/kedro/kedro/.asv /home/runner/work/kedro/kedro/kedro-benchmark-results/
- name: Commit and Push changes to kedro-org/kedro-benchmark-results
run: |
cd kedro-benchmark-results
git add .
git commit -m "Add results"
git push
15 changes: 15 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,23 @@
# Upcoming Release

## Major features and improvements
* Implemented `KedroDataCatalog` repeating `DataCatalog` functionality with a few API enhancements:
* Removed `_FrozenDatasets` and access datasets as properties;
* Added get dataset by name feature;
* `add_feed_dict()` was simplified and renamed to `add_data()`;
* Datasets' initialisation was moved out from `from_config()` method to the constructor.
* Moved development requirements from `requirements.txt` to the dedicated section in `pyproject.toml` for project template.
* Implemented `Protocol` abstraction for the current `DataCatalog` and adding new catalog implementations.
* Refactored `kedro run` and `kedro catalog` commands.
* Moved pattern resolution logic from `DataCatalog` to a separate component - `CatalogConfigResolver`. Updated `DataCatalog` to use `CatalogConfigResolver` internally.
* Made packaged Kedro projects return `session.run()` output to be used when running it in the interactive environment.
* Enhanced `OmegaConfigLoader` configuration validation to detect duplicate keys at all parameter levels, ensuring comprehensive nested key checking.
## Bug fixes and other changes
* Fixed bug where using dataset factories breaks with `ThreadRunner`.
* Fixed a bug where `SharedMemoryDataset.exists` would not call the underlying `MemoryDataset`.
* Fixed template projects example tests.
* Made credentials loading consistent between `KedroContext._get_catalog()` and `resolve_patterns` so that both us
e `_get_config_credentials()`

## Breaking changes to the API
* Removed `ShelveStore` to address a security vulnerability.
Expand All @@ -17,6 +29,9 @@
## Community contributions
* [Puneet](https://github.com/puneeter)
* [ethanknights](https://github.com/ethanknights)
* [Manezki](https://github.com/Manezki)
* [MigQ2](https://github.com/MigQ2)
* [Felix Scherz](https://github.com/felixscherz)

Check warning on line 34 in RELEASE.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Kedro.Spellings] Did you really mean 'Scherz'? Raw Output: {"message": "[Kedro.Spellings] Did you really mean 'Scherz'?", "location": {"path": "RELEASE.md", "range": {"start": {"line": 34, "column": 10}}}, "severity": "WARNING"}

# Release 0.19.8

Expand Down
12 changes: 12 additions & 0 deletions asv.conf.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"version": 1,
"project": "Kedro",
"project_url": "https://kedro.org/",
"repo": ".",
"install_command": ["pip install -e ."],
"branches": ["main"],
"environment_type": "virtualenv",
"show_commit_url": "http://github.com/kedro-org/kedro/commit/",
"results_dir": ".asv/results",
"html_dir": ".asv/html"
}
Empty file added benchmarks/__init__.py
Empty file.
16 changes: 16 additions & 0 deletions benchmarks/benchmark_dummy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Write the benchmarking functions here.
# See "Writing benchmarks" in the asv docs for more information.


class TimeSuite:
"""
A dummy benchmark suite to test with asv framework.
"""
def setup(self):
self.d = {}
for x in range(500):
self.d[x] = None

def time_keys(self):
for key in self.d.keys():
pass
6 changes: 6 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,14 @@
"typing.Type",
"typing.Set",
"kedro.config.config.ConfigLoader",
"kedro.io.catalog_config_resolver.CatalogConfigResolver",
"kedro.io.core.AbstractDataset",
"kedro.io.core.AbstractVersionedDataset",
"kedro.io.core.CatalogProtocol",
"kedro.io.core.DatasetError",
"kedro.io.core.Version",
"kedro.io.data_catalog.DataCatalog",
"kedro.io.kedro_data_catalog.KedroDataCatalog",
"kedro.io.memory_dataset.MemoryDataset",
"kedro.io.partitioned_dataset.PartitionedDataset",
"kedro.pipeline.pipeline.Pipeline",
Expand Down Expand Up @@ -168,6 +171,9 @@
"D[k] if k in D, else d. d defaults to None.",
"None. Update D from mapping/iterable E and F.",
"Patterns",
"CatalogConfigResolver",
"CatalogProtocol",
"KedroDataCatalog",
),
"py:data": (
"typing.Any",
Expand Down
4 changes: 2 additions & 2 deletions docs/source/contribution/technical_steering_committee.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,10 @@ We look for commitment markers who can do the following:
| [Huong Nguyen](https://github.com/Huongg) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Ivan Danov](https://github.com/idanov) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Jitendra Gundaniya](https://github.com/jitu5) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Joel Schwarzmann](https://github.com/datajoely) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Joel Schwarzmann](https://github.com/datajoely) | [Aneira Health](https://www.aneira.health) |
| [Juan Luis Cano](https://github.com/astrojuanlu) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Laura Couto](https://github.com/lrcouto) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Marcin Zabłocki](https://github.com/marrrcin) | [Printify, Inc.](https://printify.com/) |
| [Marcin Zabłocki](https://github.com/marrrcin) | [Printify, Inc.](https://printify.com/) |
| [Merel Theisen](https://github.com/merelcht) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Nok Lam Chan](https://github.com/noklam) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
| [Rashida Kanchwala](https://github.com/rashidakanchwala) | [QuantumBlack, AI by McKinsey](https://www.mckinsey.com/capabilities/quantumblack) |
Expand Down
34 changes: 17 additions & 17 deletions docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## AbstractDataset

If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
If you are a contributor and would like to submit a new dataset, you must extend the {py:class}`~kedro.io.AbstractDataset` interface or {py:class}`~kedro.io.AbstractVersionedDataset` interface if you plan to support versioning. It requires subclasses to implement the `load` and `save` methods while providing wrappers that enrich the corresponding methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.


## Scenario
Expand All @@ -31,8 +31,8 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta

At the minimum, a valid Kedro dataset needs to subclass the base {py:class}`~kedro.io.AbstractDataset` and provide an implementation for the following abstract methods:

* `_load`
* `_save`
* `load`
* `save`
* `_describe`

`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
Expand Down Expand Up @@ -70,15 +70,15 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
"""
self._filepath = filepath

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Data from the image file as a numpy array.
"""
...

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath"""
...

Expand All @@ -96,11 +96,11 @@ src/kedro_pokemon/datasets
└── image_dataset.py
```

## Implement the `_load` method with `fsspec`
## Implement the `load` method with `fsspec`

Many of the built-in Kedro datasets rely on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) as a consistent interface to different data sources, as described earlier in the section about the [Data Catalog](../data/data_catalog.md#dataset-filepath). In this example, it's particularly convenient to use `fsspec` in conjunction with `Pillow` to read image data, since it allows the dataset to work flexibly with different image locations and formats.

Here is the implementation of the `_load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:
Here is the implementation of the `load` method using `fsspec` and `Pillow` to read the data of a single image into a `numpy` array:

<details>
<summary><b>Click to expand</b></summary>
Expand Down Expand Up @@ -130,7 +130,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand Down Expand Up @@ -168,14 +168,14 @@ In [2]: from PIL import Image
In [3]: Image.fromarray(image).show()
```

## Implement the `_save` method with `fsspec`
## Implement the `save` method with `fsspec`

Similarly, we can implement the `_save` method as follows:


```python
class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
# using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
save_path = get_filepath_str(self._filepath, self._protocol)
Expand Down Expand Up @@ -243,7 +243,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
self._filepath = PurePosixPath(path)
self._fs = fsspec.filesystem(self._protocol)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand All @@ -254,7 +254,7 @@ class ImageDataset(AbstractDataset[np.ndarray, np.ndarray]):
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
save_path = get_filepath_str(self._filepath, self._protocol)
with self._fs.open(save_path, mode="wb") as f:
Expand Down Expand Up @@ -312,7 +312,7 @@ To add versioning support to the new dataset we need to extend the
{py:class}`~kedro.io.AbstractVersionedDataset` to:

* Accept a `version` keyword argument as part of the constructor
* Adapt the `_load` and `_save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively
* Adapt the `load` and `save` method to use the versioned data path obtained from `_get_load_path` and `_get_save_path` respectively

The following amends the full implementation of our basic `ImageDataset`. It now loads and saves data to and from a versioned subfolder (`data/01_raw/pokemon-images-and-types/images/images/pikachu.png/<version>/pikachu.png` with `version` being a datetime-formatted string `YYYY-MM-DDThh.mm.ss.sssZ` by default):

Expand Down Expand Up @@ -359,7 +359,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
glob_function=self._fs.glob,
)

def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.
Returns:
Expand All @@ -370,7 +370,7 @@ class ImageDataset(AbstractVersionedDataset[np.ndarray, np.ndarray]):
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
save_path = get_filepath_str(self._get_save_path(), self._protocol)
with self._fs.open(save_path, mode="wb") as f:
Expand Down Expand Up @@ -435,7 +435,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
+ glob_function=self._fs.glob,
+ )
+
def _load(self) -> np.ndarray:
def load(self) -> np.ndarray:
"""Loads data from the image file.

Returns:
Expand All @@ -447,7 +447,7 @@ The difference between the original `ImageDataset` and the versioned `ImageDatas
image = Image.open(f).convert("RGBA")
return np.asarray(image)

def _save(self, data: np.ndarray) -> None:
def save(self, data: np.ndarray) -> None:
"""Saves image data to the specified filepath."""
- save_path = get_filepath_str(self._filepath, self._protocol)
+ save_path = get_filepath_str(self._get_save_path(), self._protocol)
Expand Down
29 changes: 22 additions & 7 deletions docs/source/development/automated_testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,36 @@ There are many testing frameworks available for Python. One of the most popular

Let's look at how you can start working with `pytest` in your Kedro project.

### Prerequisite: Install your Kedro project
### Install test requirements
Before getting started with test requirements, it is important to ensure you have installed your project locally. This allows you to test different parts of your project by importing them into your test files.


To install your project including all the project-specific dependencies and test requirements:
1. Add the following section to the `pyproject.toml` file located in the project root:
```toml
[project.optional-dependencies]
dev = [
"pytest-cov",
"pytest-mock",
"pytest",
]
```

2. Navigate to the root directory of the project and run:
```bash
pip install ."[dev]"
```

Before getting started with `pytest`, it is important to ensure you have installed your project locally. This allows you to test different parts of your project by importing them into your test files.
Alternatively, you can individually install test requirements as you would install other packages with `pip`, making sure you have installed your project locally and your [project's virtual environment is active](../get_started/install.md#create-a-virtual-environment-for-your-kedro-project).

To install your project, navigate to your project root and run the following command:
1. To install your project, navigate to your project root and run the following command:

```bash
pip install -e .
```

>**NOTE**: The option `-e` installs an editable version of your project, allowing you to make changes to the project files without needing to re-install them each time.
### Install `pytest`

Install `pytest` as you would install other packages with `pip`, making sure your [project's virtual environment is active](../get_started/install.md#create-a-virtual-environment-for-your-kedro-project).
2. Install test requirements one by one:
```bash
pip install pytest
```
Expand Down
12 changes: 6 additions & 6 deletions docs/source/development/linting.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,17 @@ There are a variety of Python tools available to use with your Kedro projects. T
type.

### Install the tools
Install `ruff` by adding the following lines to your project's `requirements.txt`
file:
```text
ruff # Used for linting, formatting and sorting module imports
To install `ruff` add the following section to the `pyproject.toml` file located in the project root:
```toml
[project.optional-dependencies]
dev = ["ruff"]
```

To install all the project-specific dependencies, including the linting tools, navigate to the root directory of the
Then to install your project including all the project-specific dependencies and the linting tools, navigate to the root directory of the
project and run:

```bash
pip install -r requirements.txt
pip install ."[dev]"
```

Alternatively, you can individually install the linting tools using the following shell commands:
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 8 additions & 1 deletion docs/source/nodes_and_pipelines/slice_a_pipeline.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# Slice a pipeline

Sometimes it is desirable to run a subset, or a 'slice' of a pipeline's nodes. In this page, we illustrate the programmatic options that Kedro provides. You can also use the [Kedro CLI to pass parameters to `kedro run`](../development/commands_reference.md#run-the-project) command and slice a pipeline.
Sometimes it is desirable to run a subset, or a 'slice' of a pipeline's nodes. There are two primary ways to achieve this:


1. **Visually through Kedro-Viz:** This approach allows you to visually choose and slice pipeline nodes, which then generates a run command for executing the slice within your Kedro project. Detailed steps on how to achieve this are available in the Kedro-Viz documentation: [Slice a Pipeline](https://docs.kedro.org/projects/kedro-viz/en/stable/slice_a_pipeline.html).

![](../meta/images/slice_pipeline_kedro_viz.gif)

2. **Programmatically with the Kedro CLI.** You can also use the [Kedro CLI to pass parameters to `kedro run`](../development/commands_reference.md#run-the-project) command and slice a pipeline. In this page, we illustrate the programmatic options that Kedro provides.

Let's look again at the example pipeline from the [pipeline introduction documentation](./pipeline_introduction.md#how-to-build-a-pipeline), which computes the variance of a set of numbers:

Expand Down
3 changes: 2 additions & 1 deletion features/load_node.feature
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,6 @@ Feature: load_node in new project
And I have run a non-interactive kedro new with starter "default"

Scenario: Execute ipython load_node magic
When I execute the load_node magic command
When I install project and its dev dependencies
And I execute the load_node magic command
Then the logs should show that load_node executed successfully
Loading

0 comments on commit 045d130

Please sign in to comment.