Skip to content

Commit

Permalink
♻️ Rename DataSet to Dataset for import from kedro.io.core (#462)
Browse files Browse the repository at this point in the history
  • Loading branch information
Galileo-Galilei authored Oct 21, 2023
1 parent f7baa8a commit 8e0de86
Show file tree
Hide file tree
Showing 30 changed files with 162 additions and 166 deletions.
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@

### Fixed

- :bug: Force the input dataset in `KedroPipelineModel` to be a `MemoryDataSet` to remove unnecessary dependency to the underlying Kedro `AbstractDataSet` used during training ([#273](https://github.com/Galileo-Galilei/kedro-mlflow/issues/273))
- :bug: Make `MlflowArtifactDataset` correctly log in mlflow Kedro DataSets without a `_path` attribute like `kedro.io.PartitionedDataSet` ([#258](https://github.com/Galileo-Galilei/kedro-mlflow/issues/258)).
- :bug: Force the input dataset in `KedroPipelineModel` to be a `MemoryDataset` to remove unnecessary dependency to the underlying Kedro `AbstractDataset` used during training ([#273](https://github.com/Galileo-Galilei/kedro-mlflow/issues/273))
- :bug: Make `MlflowArtifactDataset` correctly log in mlflow Kedro DataSets without a `_path` attribute like `kedro.io.PartitionedDataset` ([#258](https://github.com/Galileo-Galilei/kedro-mlflow/issues/258)).
- :bug: Automatically persist pipeline parameters when calling the `kedro mlflow modelify` command for consistency with how `PipelineML` objects are handled and for ease of use ([#282](https://github.com/Galileo-Galilei/kedro-mlflow/issues/282)).

## [0.8.0] - 2022-01-05
Expand Down Expand Up @@ -356,7 +356,7 @@
### Fixed

- :bug: Versioned datasets artifacts logging are handled correctly ([#41](https://github.com/Galileo-Galilei/kedro-mlflow/issues/41))
- :bug: MlflowDataSet handles correctly datasets which are inherited from AbstractDataSet ([#45](https://github.com/Galileo-Galilei/kedro-mlflow/issues/45))
- :bug: MlflowDataSet handles correctly datasets which are inherited from AbstractDataset ([#45](https://github.com/Galileo-Galilei/kedro-mlflow/issues/45))
- :zap: Change the test in `_generate_kedro_command` to accept both empty `Iterable`s(default in CLI mode) and `None` values (default in interactive mode) ([#50](https://github.com/Galileo-Galilei/kedro-mlflow/issues/50))
- :zap: Force to close all mlflow runs when a pipeline fails. It prevents further execution of the pipeline to be logged within the same mlflow run_id as the failing pipeline. ([#10](https://github.com/Galileo-Galilei/kedro-mlflow/issues/10))
- :memo: Fix various documentation typos ([#34](https://github.com/Galileo-Galilei/kedro-mlflow/pull/34), [#35](https://github.com/Galileo-Galilei/kedro-mlflow/pull/35), [#36](https://github.com/Galileo-Galilei/kedro-mlflow/pull/36) and more)
Expand Down
8 changes: 4 additions & 4 deletions docs/source/01_introduction/01_introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ While ``Kedro`` and ``Mlflow`` do not compete in the same field, they provide so

| Functionality | Kedro | Mlflow |
| :----------------------------- | :------------------------------------------------ | :---------------------------------------------------------------------------------- |
| I/O abstraction | various ``AbstractDataSet`` | N/A |
| I/O abstraction | various ``AbstractDataset`` | N/A |
| I/O configuration files | - ``catalog.yml`` <br> - ``parameters.yml`` | ``MLproject`` |
| Compute abstraction | - ``Pipeline`` <br> - ``Node`` | N/A |
| Compute configuration files | - ``hooks.py`` <br> - ``run.py`` | `MLproject` |
| Parameters and data versioning | - ``Journal`` <br> - ``AbstractVersionedDataSet`` | - ``log_metric``<br> - ``log_artifact``<br> - ``log_param`` |
| Parameters and data versioning | - ``Journal`` <br> - ``AbstractVersionedDataset`` | - ``log_metric``<br> - ``log_artifact``<br> - ``log_param`` |
| Cli execution | command ``kedro run`` | command ``mlflow run`` |
| Code packaging | command ``kedro package`` | N/A |
| Model packaging | N/A | - ``Mlflow Models`` (``mlflow.XXX.log_model`` functions) <br> - ``Mlflow Flavours`` |
Expand All @@ -40,7 +40,7 @@ We discuss hereafter how the two libraries compete on the different functionalit
``Mlflow`` and ``Kedro`` are essentially overlapping on the way they offer a dedicated configuration files for running the pipeline from CLI. However:

- ``Mlflow`` provides a single configuration file (the ``MLProject``) where all elements are declared (data, parameters and pipelines). Its goal is mainly to enable CLI execution of the project, but it is not very flexible. In my opinion, this file is **production oriented** and is not really intended to use for exploration.
- ``Kedro`` offers a bunch of files (``catalog.yml``, ``parameters.yml``, ``pipeline.py``) and their associated abstraction (``AbstractDataSet``, ``DataCatalog``, ``Pipeline`` and ``node`` objects). ``Kedro`` is much more opinionated: each object has a dedicated place (and only one!) in the template. This makes the framework both **exploration and production oriented**. The downside is that it could make the learning curve a bit sharper since a newcomer has to learn all ``Kedro`` specifications. It also provides a ``kedro-viz`` plugin to visualize the DAG interactively, which is particularly handy in medium-to-big projects.
- ``Kedro`` offers a bunch of files (``catalog.yml``, ``parameters.yml``, ``pipeline.py``) and their associated abstraction (``AbstractDataset``, ``DataCatalog``, ``Pipeline`` and ``node`` objects). ``Kedro`` is much more opinionated: each object has a dedicated place (and only one!) in the template. This makes the framework both **exploration and production oriented**. The downside is that it could make the learning curve a bit sharper since a newcomer has to learn all ``Kedro`` specifications. It also provides a ``kedro-viz`` plugin to visualize the DAG interactively, which is particularly handy in medium-to-big projects.


> **``Kedro`` is a clear winner here, since it provides more functionnalities than ``Mlflow``. It handles very well _by design_ the exploration phase of data science projects when Mlflow is less flexible.**
Expand All @@ -52,7 +52,7 @@ We discuss hereafter how the two libraries compete on the different functionalit
The ``Kedro`` ``Journal`` aimed at reproducibility (it was removed in ``kedro==0.18``), but is not focused on machine learning. The `Journal` keeps track of two elements:

- the CLI arguments, including *on the fly* parameters. This makes the command used to run the pipeline fully reproducible.
- the ``AbstractVersionedDataSet`` for which versioning is activated. It consists in copying the data whom ``versioned`` argument is ``True`` when the ``save`` method of the ``AbstractVersionedDataSet`` is called.
- the ``AbstractVersionedDataset`` for which versioning is activated. It consists in copying the data whom ``versioned`` argument is ``True`` when the ``save`` method of the ``AbstractVersionedDataset`` is called.
This approach suffers from two main drawbacks:
- the configuration is assumed immutable (including parameters), which is not realistic ni machine learning projects where they are very volatile. To fix this, the ``git sha`` has been recently added to the ``Journal``, but it has still some bugs in my experience (including the fact that the current ``git sha`` is logged even if the pipeline is ran with uncommitted change, which prevents reproducibility). This is still recent and will likely evolve in the future.
- there is no support for browsing old runs, which prevents [cleaning the database with old and unused datasets](https://github.com/quantumblacklabs/kedro/issues/406), compare runs between each other...
Expand Down
30 changes: 15 additions & 15 deletions docs/source/03_getting_started/02_first_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,26 +34,26 @@ If the pipeline executes properly, you should see the following log:

```console
2020-07-13 21:29:25,401 - kedro.io.data_catalog - INFO - Loading data from `example_iris_data` (CSVDataset)...
2020-07-13 21:29:25,562 - kedro.io.data_catalog - INFO - Loading data from `params:example_test_data_ratio` (MemoryDataSet)...
2020-07-13 21:29:25,562 - kedro.io.data_catalog - INFO - Loading data from `params:example_test_data_ratio` (MemoryDataset)...
2020-07-13 21:29:25,969 - kedro.pipeline.node - INFO - Running node: split_data([example_iris_data,params:example_test_data_ratio]) -> [example_test_x,example_test_y,example_train_x,example_train_y]
2020-07-13 21:29:26,053 - kedro.io.data_catalog - INFO - Saving data to `example_train_x` (MemoryDataSet)...
2020-07-13 21:29:26,368 - kedro.io.data_catalog - INFO - Saving data to `example_train_y` (MemoryDataSet)...
2020-07-13 21:29:26,484 - kedro.io.data_catalog - INFO - Saving data to `example_test_x` (MemoryDataSet)...
2020-07-13 21:29:26,486 - kedro.io.data_catalog - INFO - Saving data to `example_test_y` (MemoryDataSet)...
2020-07-13 21:29:26,053 - kedro.io.data_catalog - INFO - Saving data to `example_train_x` (MemoryDataset)...
2020-07-13 21:29:26,368 - kedro.io.data_catalog - INFO - Saving data to `example_train_y` (MemoryDataset)...
2020-07-13 21:29:26,484 - kedro.io.data_catalog - INFO - Saving data to `example_test_x` (MemoryDataset)...
2020-07-13 21:29:26,486 - kedro.io.data_catalog - INFO - Saving data to `example_test_y` (MemoryDataset)...
2020-07-13 21:29:26,610 - kedro.runner.sequential_runner - INFO - Completed 1 out of 4 tasks
2020-07-13 21:29:26,850 - kedro.io.data_catalog - INFO - Loading data from `example_train_x` (MemoryDataSet)...
2020-07-13 21:29:26,851 - kedro.io.data_catalog - INFO - Loading data from `example_train_y` (MemoryDataSet)...
2020-07-13 21:29:26,965 - kedro.io.data_catalog - INFO - Loading data from `parameters` (MemoryDataSet)...
2020-07-13 21:29:26,850 - kedro.io.data_catalog - INFO - Loading data from `example_train_x` (MemoryDataset)...
2020-07-13 21:29:26,851 - kedro.io.data_catalog - INFO - Loading data from `example_train_y` (MemoryDataset)...
2020-07-13 21:29:26,965 - kedro.io.data_catalog - INFO - Loading data from `parameters` (MemoryDataset)...
2020-07-13 21:29:26,972 - kedro.pipeline.node - INFO - Running node: train_model([example_train_x,example_train_y,parameters]) -> [example_model]
2020-07-13 21:29:27,756 - kedro.io.data_catalog - INFO - Saving data to `example_model` (MemoryDataSet)...
2020-07-13 21:29:27,756 - kedro.io.data_catalog - INFO - Saving data to `example_model` (MemoryDataset)...
2020-07-13 21:29:27,763 - kedro.runner.sequential_runner - INFO - Completed 2 out of 4 tasks
2020-07-13 21:29:28,141 - kedro.io.data_catalog - INFO - Loading data from `example_model` (MemoryDataSet)...
2020-07-13 21:29:28,161 - kedro.io.data_catalog - INFO - Loading data from `example_test_x` (MemoryDataSet)...
2020-07-13 21:29:28,141 - kedro.io.data_catalog - INFO - Loading data from `example_model` (MemoryDataset)...
2020-07-13 21:29:28,161 - kedro.io.data_catalog - INFO - Loading data from `example_test_x` (MemoryDataset)...
2020-07-13 21:29:28,670 - kedro.pipeline.node - INFO - Running node: predict([example_model,example_test_x]) -> [example_predictions]
2020-07-13 21:29:29,002 - kedro.io.data_catalog - INFO - Saving data to `example_predictions` (MemoryDataSet)...
2020-07-13 21:29:29,002 - kedro.io.data_catalog - INFO - Saving data to `example_predictions` (MemoryDataset)...
2020-07-13 21:29:29,248 - kedro.runner.sequential_runner - INFO - Completed 3 out of 4 tasks
2020-07-13 21:29:29,433 - kedro.io.data_catalog - INFO - Loading data from `example_predictions` (MemoryDataSet)...
2020-07-13 21:29:29,730 - kedro.io.data_catalog - INFO - Loading data from `example_test_y` (MemoryDataSet)...
2020-07-13 21:29:29,433 - kedro.io.data_catalog - INFO - Loading data from `example_predictions` (MemoryDataset)...
2020-07-13 21:29:29,730 - kedro.io.data_catalog - INFO - Loading data from `example_test_y` (MemoryDataset)...
2020-07-13 21:29:29,911 - kedro.pipeline.node - INFO - Running node: report_accuracy([example_predictions,example_test_y]) -> None
2020-07-13 21:29:30,056 - km_example.pipelines.data_science.nodes - INFO - Model accuracy on test set: 100.00%
2020-07-13 21:29:30,214 - kedro.runner.sequential_runner - INFO - Completed 4 out of 4 tasks
Expand Down Expand Up @@ -142,7 +142,7 @@ example_iris_data:
example_model:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
data_set:
type: pickle.PickleDataset
type: pickle.PickleDataSet
filepath: data/06_models/trained_model.pkl
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ Artifacts are a very flexible and convenient way to "bind" any data type to your

## How to version data in a kedro project?

``kedro-mlflow`` introduces a new ``AbstractDataSet`` called ``MlflowArtifactDataset``. It is a wrapper for any ``AbstractDataSet`` which decorates the underlying dataset ``save`` method and logs the file automatically in mlflow as an artifact each time the ``save`` method is called.
``kedro-mlflow`` introduces a new ``AbstractDataset`` called ``MlflowArtifactDataset``. It is a wrapper for any ``AbstractDataset`` which decorates the underlying dataset ``save`` method and logs the file automatically in mlflow as an artifact each time the ``save`` method is called.

Since it is an ``AbstractDataSet``, it can be used with the YAML API. Assume that you have the following entry in the ``catalog.yml``:
Since it is an ``AbstractDataset``, it can be used with the YAML API. Assume that you have the following entry in the ``catalog.yml``:

```yaml
my_dataset_to_version:
Expand Down Expand Up @@ -57,7 +57,7 @@ my_dataset_to_version:

### Can I use the ``MlflowArtifactDataset`` in interactive mode?

Like all Kedro ``AbstractDataSet``, ``MlflowArtifactDataset`` is callable in the python API:
Like all Kedro ``AbstractDataset``, ``MlflowArtifactDataset`` is callable in the python API:

```python
from kedro_mlflow.io.artifacts import MlflowArtifactDataset
Expand Down
12 changes: 6 additions & 6 deletions docs/source/04_experimentation_tracking/05_version_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ MLflow defines a metric as "a (key, value) pair, where the value is numeric". Ea

## How to version metrics in a kedro project?

`kedro-mlflow` introduces 3 ``AbstractDataSet`` to manage metrics:
`kedro-mlflow` introduces 3 ``AbstractDataset`` to manage metrics:
- ``MlflowMetricDataset`` which can log a float as a metric
- ``MlflowMetricHistoryDataset`` which can log the evolution over time of a given metric, e.g. a list or a dict of float.
- ``MlflowMetricsDataset``. It is a wrapper around a dictionary with metrics which is returned by node and log metrics in MLflow.

### Saving a single float as a metric with ``MlflowMetricDataset``

The ``MlflowMetricDataset`` is an ``AbstractDataSet`` which enable to save or load a ``float`` as a mlflow metric. You must specify the ``key`` (i.e. the name to display in mlflow) when creating the dataset. Somes examples follow:
The ``MlflowMetricDataset`` is an ``AbstractDataset`` which enable to save or load a ``float`` as a mlflow metric. You must specify the ``key`` (i.e. the name to display in mlflow) when creating the dataset. Somes examples follow:

- The most basic usage is to create the dataset and save a a value:

Expand Down Expand Up @@ -60,7 +60,7 @@ with mlflow.start_run():
my_metric = metric_ds.load() # value=0.1 (step number 1)
```

Since it is an ``AbstractDataSet``, it can be used with the YAML API in your ``catalog.yml``, e.g. :
Since it is an ``AbstractDataset``, it can be used with the YAML API in your ``catalog.yml``, e.g. :

```yaml
my_model_metric:
Expand All @@ -76,7 +76,7 @@ my_model_metric:
### Saving the evolution of a metric during training with ``MlflowMetricHistoryDataset``
The ``MlflowMetricDataset`` is an ``AbstractDataSet`` which enable to save or load the evolutionf of a metric with various formats. You must specify the ``key`` (i.e. the name to display in mlflow) when creating the dataset. Somes examples follow:
The ``MlflowMetricDataset`` is an ``AbstractDataset`` which enable to save or load the evolutionf of a metric with various formats. You must specify the ``key`` (i.e. the name to display in mlflow) when creating the dataset. Somes examples follow:
It enables logging either:
- a list of int as a metric with incremental step, e.g ``[0.1,0.2,0.3]`` with ``mode=list`` for either ``save_args`` or ``load_args``
Expand Down Expand Up @@ -135,7 +135,7 @@ with mlflow.start_run():
metric_history_ds.load() # return [0.1,0.2,0.3]
```

As usual, since it is an ``AbstractDataSet``, it can be used with the YAML API in your ``catalog.yml``, and in this case, the ``key`` argument is optional:
As usual, since it is an ``AbstractDataset``, it can be used with the YAML API in your ``catalog.yml``, and in this case, the ``key`` argument is optional:

```yaml
my_model_metric:
Expand All @@ -150,7 +150,7 @@ my_model_metric:
### Saving several metrics with their entire history with ``MlflowMetricsDataset``
Since it is an ``AbstractDataSet``, it can be used with the YAML API. You can define it in your ``catalog.yml`` as:
Since it is an ``AbstractDataset``, it can be used with the YAML API. You can define it in your ``catalog.yml`` as:
```yaml
my_model_metrics:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/05_pipeline_serving/01_mlflow_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Mlflow enable to create custom models "flavors" to convert any object to a Mlflo
## Pre-requisite for serving a pipeline

You can log any Kedro ``Pipeline`` matching the following requirements:
- one of its input must be a ``pandas.DataFrame``, a ``spark.DataFrame`` or a ``numpy.array``. This is the **input which contains the data to predict on**. This can be any Kedro ``AbstractDataSet`` which loads data in one of the previous three formats. It can also be a ``MemoryDataSet`` and not be persisted in the ``catalog.yml``.
- one of its input must be a ``pandas.DataFrame``, a ``spark.DataFrame`` or a ``numpy.array``. This is the **input which contains the data to predict on**. This can be any Kedro ``AbstractDataset`` which loads data in one of the previous three formats. It can also be a ``MemoryDataset`` and not be persisted in the ``catalog.yml``.
- all its other inputs must be persisted on disk (e.g. if the machine learning model must already be trained and saved so we can export it).

*Note: if the pipeline has parameters, they will be persisted before exporting the model, which implies that you will not be able to modify them at runtime. This is a limitation of ``mlflow``.*
4 changes: 2 additions & 2 deletions docs/source/07_python_objects/01_DataSets.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## ``MlflowArtifactDataset``

``MlflowArtifactDataset`` is a wrapper for any ``AbstractDataSet`` which logs the dataset automatically in mlflow as an artifact when its ``save`` method is called. It can be used both with the YAML API:
``MlflowArtifactDataset`` is a wrapper for any ``AbstractDataset`` which logs the dataset automatically in mlflow as an artifact when its ``save`` method is called. It can be used both with the YAML API:

```yaml
my_dataset_to_version:
Expand Down Expand Up @@ -138,7 +138,7 @@ mlflow_model_logger = MlflowModelSaverDataSet(
mlflow_model_logger.save(LinearRegression().fit(data))
```

The same arguments are available, plus an additional [`version` common to usual `AbstractVersionedDataSet`](https://kedro.readthedocs.io/en/stable/kedro.io.AbstractVersionedDataSet.html)
The same arguments are available, plus an additional [`version` common to usual `AbstractVersionedDataset`](https://kedro.readthedocs.io/en/stable/kedro.io.AbstractVersionedDataset.html)

```python
mlflow_model_logger = MlflowModelSaverDataSet(
Expand Down
Loading

0 comments on commit 8e0de86

Please sign in to comment.