Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe integration with MLflow #3856

Merged
merged 9 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@
"kedro-datasets": ("https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.0.0/", None),
"cpython": ("https://docs.python.org/3.8/", None),
"ipython": ("https://ipython.readthedocs.io/en/8.21.0/", None),
"mlflow": ("https://www.mlflow.org/docs/2.12.1/", None),
"kedro-mlflow": ("https://kedro-mlflow.readthedocs.io/en/0.12.2/", None),
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
}

# The suffix(es) of source filenames.
Expand Down Expand Up @@ -521,3 +523,4 @@ def setup(app):
user_agent = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"

myst_heading_anchors = 5
myst_enable_extensions = ["colon_fence"]
1 change: 1 addition & 0 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ my_param: "${globals: nonexistent_global, 23}"
If there are duplicate keys in the globals files in your base and runtime environments, the values in the runtime environment
overwrite the values in your base environment.

(runtime-params)=
### How to override configuration with runtime parameters with the `OmegaConfigLoader`

Kedro allows you to [specify runtime parameters for the `kedro run` command with the `--params` CLI option](parameters.md#how-to-specify-parameters-at-runtime). These runtime parameters
Expand Down
8 changes: 7 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,13 @@ Welcome to Kedro's award-winning documentation!
nodes_and_pipelines/index.md
configuration/telemetry.md

.. toctree::
:maxdepth: 2
:caption: Integrations

integrations/pyspark_integration.md
integrations/mlflow.md

.. toctree::
:maxdepth: 2
:caption: Advanced usage
Expand All @@ -90,7 +97,6 @@ Welcome to Kedro's award-winning documentation!
extend_kedro/index.md
hooks/index.md
logging/index.md
integrations/pyspark_integration.md
development/index.md
deployment/index.md

Expand Down
311 changes: 311 additions & 0 deletions docs/source/integrations/mlflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
# How to add MLflow to your Kedro workflow

[MLflow](https://mlflow.org/) is an open-source platform for managing the end-to-end machine learning lifecycle.
It provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models.
MLflow supports machine learning frameworks such as TensorFlow, PyTorch, and scikit-learn.

Adding MLflow to a Kedro project enables you to track and manage your machine learning experiments and models.
For example, you can log metrics, parameters, and artifacts from your Kedro pipeline runs to MLflow, then compare and reproduce the results. When collaborating with others on a Kedro project, MLflow's model registry and deployment tools help you to share and deploy machine learning models.

Check warning on line 8 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L8

[Kedro.Spellings] Did you really mean 'MLflow's'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'MLflow's'?", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 8, "column": 193}}}, "severity": "WARNING"}

## Prerequisites

You will need the following:

- A working Kedro project in a virtual environment. The examples in this document assume the `spaceflights-pandas-viz` starter.
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
If you're unfamiliar with the Spaceflights project, check out [our tutorial](/tutorial/spaceflights_tutorial).
- The MLflow client installed into the same virtual environment. For the purposes of this tutorial,
you can use MLflow {external+mlflow:doc}`in its simplest configuration <tracking>`.

To set yourself up, create a new Kedro project:

```
$ kedro new --starter=spaceflights-pandas-viz --name spaceflights-mlflow
$ cd spaceflights-mlflow
$ python -m venv && source .venv/bin/activate
(.venv) $ pip install -r requirements.txt
```

And then launch the UI locally from the root of your directory as follows:

```
(.venv) $ pip install mlflow
(.venv) $ mlflow ui --backend-store-uri ./mlflow_runs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not leaving the default location? Well, because it's called mlruns, which could be conflated with https://github.com/mlrun/mlrun 😬 This requires us to write a mlflow.yml in the first step, but I think it's not the end of the world

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually better to show the mlflow.yml to the users :)

```

This will make MLflow record metadata and artifacts for each run
to a local directory called `mlflow_runs`.

:::{note}
If you want to use a more sophisticated setup,
have a look at the documentation of
[MLflow tracking server](https://mlflow.org/docs/latest/tracking/server.html),
{external+mlflow:doc}`the official MLflow tracking server 5 minute overview <getting-started/tracking-server-overview/index>`,
and {external+mlflow:ref}`the MLflow tracking server documentation <logging_to_a_tracking_server>`.
:::

## Simple use cases

Although MLflow works best when working with machine learning (ML) and AI pipelines,

Check notice on line 48 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L48

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 48, "column": 1}}}, "severity": "INFO"}
you can track your regular Kedro runs as experiments in MLflow even if they do not use ML.

This section explains how you can use the [`kedro-mlflow`](https://kedro-mlflow.readthedocs.io/) plugin
to track your Kedro pipelines in MLflow in a straightforward way.

### Easy tracking of Kedro runs in MLflow using `kedro-mlflow`

To start using `kedro-mlflow`, install it first:

```
pip install kedro-mlflow
```

In recent versions of Kedro, this will already register the `kedro-mlflow` Hooks for you.

Next, create a `mlflow.yml` configuration file in your `conf/local` directory
that configures where the MLflow runs are stored,
consistent with how you launched the `mlflow ui` command:

```yaml
server:
mlflow_tracking_uri: mlflow_runs
```

From this point, when you execute `kedro run` you will see the logs coming from `kedro-mlflow`:

```
[06/04/24 09:52:53] INFO Kedro project spaceflights-mlflow session.py:324
INFO Registering new custom resolver: 'km.random_name' mlflow_hook.py:65
INFO The 'tracking_uri' key in mlflow.yml is relative kedro_mlflow_config.py:260
('server.mlflow_(tracking|registry)_uri = mlflow_runs').
It is converted to a valid uri:
'file:///Users/juan_cano/Projects/QuantumBlackLabs/kedro-
mlflow-playground/spaceflights-mlflow/mlflow_runs'
```

If you open your tracking server UI you will observe a result like this:

```{image} ../meta/images/complete-mlflow-tracking-kedro-mlflow.png
:alt: Complete MLflow tracking with kedro-mlflow
:width: 80%
:align: center
```

Notice that `kedro-mlflow` used a subset of the `run_params` as tags for the MLflow run,
and logged the Kedro parameters as MLflow parameters.

Check out {external+kedro-mlflow:doc}`the official kedro-mlflow tutorial <source/03_getting_started/02_first_steps>`
for more detailed steps.

### Artifact tracking in MLflow using `kedro-mlflow`

`kedro-mlflow` provides some out-of-the-box artifact tracking capabilities

Check notice on line 101 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L101

[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.
Raw output
{"message": "[Kedro.sentencelength] Try to keep your sentence length to 30 words or fewer.", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 101, "column": 16}}}, "severity": "INFO"}
that connect your Kedro project with your MLflow deployment, such as `MlflowArtifactDataset`,
which can be used to wrap any of your existing Kedro datasets.

Use of this dataset has the advantage that the preview capabilities of the MLflow UI can be used.

:::{warning}
This will work for datasets that are outputs of a node,
and will have no effect for datasets that are free inputs (hence are only loaded).

Check warning on line 109 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L109

[Kedro.weaselwords] 'only' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 109, "column": 70}}}, "severity": "WARNING"}
:::

For example, if you modify the a `matplotlib.MatplotlibWriter` dataset like this:

Check warning on line 112 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L112

[Kedro.toowordy] 'modify' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'modify' is too wordy", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 112, "column": 21}}}, "severity": "WARNING"}

```diff
# conf/base/catalog.yml

dummy_confusion_matrix:
astrojuanlu marked this conversation as resolved.
Show resolved Hide resolved
- type: matplotlib.MatplotlibWriter
- filepath: data/08_reporting/dummy_confusion_matrix.png
- versioned: true
+ type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
+ dataset:
+ type: matplotlib.MatplotlibWriter
+ filepath: data/08_reporting/dummy_confusion_matrix.png
```

Then the image would be logged as part of the artifacts of the run
and you would be able to preview it in the MLflow web UI:

```{image} ../meta/images/mlflow-artifact-preview-image.png
:alt: MLflow image preview thanks to the artifact tracking capabilities of kedro-mlflow
:width: 80%
:align: center
```

:::{warning}
If you get a `Failed while saving data to data set MlflowMatplotlibWriter` error,
it's probably because you had already executed `kedro run` while the dataset was marked as `versioned: true`.
The solution is to cleanup the old `data/08_reporting/dummy_confusion_matrix.png` directory.
:::

Check out {external+kedro-mlflow:doc}`the official kedro-mlflow documentation on versioning Kedro datasets <source/04_experimentation_tracking/03_version_datasets>`
for more information.

### Model registry in MLflow using `kedro-mlflow`

If your Kedro pipeline trains a machine learning model, you can track those models in MLflow
so that you can manage and deploy them.
The `kedro-mlflow` plugin introduces a special artifact, `MlflowModelTrackingDataset`,
that you can use to load and save your models as MLflow artifacts.

For example, if you have a dataset corresponding to a scikit-learn model,
you can modify it as follows:

Check warning on line 153 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L153

[Kedro.toowordy] 'modify' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'modify' is too wordy", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 153, "column": 9}}}, "severity": "WARNING"}

```diff
regressor:
- type: pickle.PickleDataset
- filepath: data/06_models/regressor.pickle
- versioned: true
+ type: kedro_mlflow.io.models.MlflowModelTrackingDataset
+ flavor: mlflow.sklearn
```

The `kedro-mlflow` Hook will log the model as part of the run
in {external+mlflow:doc}`the standard MLflow Model format <models>`.

If you also want to _register_ it
(hence store it in the MLflow Model Registry)
you can add a `registered_model_name` parameter:

```{code-block} yaml
:emphasize-lines: 4-5

regressor:
type: kedro_mlflow.io.models.MlflowModelTrackingDataset
flavor: mlflow.sklearn
save_args:
registered_model_name: spaceflights-regressor
```

Then you will see it listed as a Registered Model:

```{image} ../meta/images/kedro-mlflow-registered-model.png
:alt: MLflow Model Registry listing one model registered with kedro-mlflow
:width: 80%
:align: center
```

To load a model from a specific run, you can specify the `run_id`.
For that, you can make use of {ref}`runtime parameters <runtime-params>`:

```{code-block} yaml
:emphasize-lines: 13

# Add the intermediate datasets to run only the inference
X_test:
type: pandas.ParquetDataset
filepath: data/05_model_input/X_test.pq

y_test:
type: pandas.CSVDataset # https://github.com/pandas-dev/pandas/issues/54638
filepath: data/05_model_input/y_test.csv

regressor:
type: kedro_mlflow.io.models.MlflowModelTrackingDataset
flavor: mlflow.sklearn
run_id: ${runtime_params:mlflow_run_id,null}
save_args:
registered_model_name: spaceflights-regressor
```

And specify the MLflow run id on the command line as follows:

```
$ kedro run --to-outputs=X_test,y_test
...
$ kedro run --from-nodes=evaluate_model_node --params mlflow_run_id=4cba84...
Comment on lines +215 to +217
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was somewhat clunky, dumped some thoughts in #3922

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it's great that you added that section! It's an interesting functionality.

```

:::{note}
Notice that MLflow runs are immutable for reproducibility purposes,
therefore you cannot _save_ a model in an existing run.

Check warning on line 222 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L222

[Kedro.toowordy] 'therefore' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'therefore' is too wordy", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 222, "column": 1}}}, "severity": "WARNING"}
:::

## Advanced use cases

### Track additional metadata of Kedro runs in MLflow using Hooks

So far, `kedro-mlflow` has proven abundantly useful already.
And yet, you might have the need to track additional metadata in the run.

One possible way of doing it is using the {py:meth}`~kedro.framework.hooks.specs.PipelineSpecs.before_pipeline_run` Hook
to log the `run_params` passed to the Hook.
An implementation would look as follows:

```python
# src/spaceflights_mlflow/hooks.py

import typing as t
import logging

import mlflow
from kedro.framework.hooks import hook_impl

logger = logging.getLogger(__name__)


class ExtraMLflowHooks:
@hook_impl
def before_pipeline_run(self, run_params: dict[str, t.Any]):
logger.info("Logging extra metadata to MLflow")
mlflow.set_tags({
"pipeline": run_params["pipeline_name"] or "__default__",
"custom_version": "0.1.0",
})
```

And then enable your custom hook in `settings.py`:

```python
# src/spaceflights_mlflow/settings.py
...
from .hooks import ExtraMLflowHooks

HOOKS = (ExtraMLflowHooks(),)
...
```

After enabling this custom Hook, you can execute `kedro run`, and see something like this in the logs:

```
...
[06/04/24 10:44:25] INFO Logging extra metadata to MLflow hooks.py:13
...
```

If you open your tracking server UI you will observe a result like this:

```{image} ../meta/images/extra-mlflow-tracking.png
:alt: Simple MLflow tracking
:width: 50%
:align: center
```

### Tracking Kedro in MLflow using the Python API

Check warning on line 285 in docs/source/integrations/mlflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/integrations/mlflow.md#L285

[Kedro.headings] 'Tracking Kedro in MLflow using the Python API' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'Tracking Kedro in MLflow using the Python API' should use sentence-style capitalization.", "location": {"path": "docs/source/integrations/mlflow.md", "range": {"start": {"line": 285, "column": 5}}}, "severity": "WARNING"}

If you are running Kedro programmatically using the Python API,
you can log your runs using the MLflow "fluent" API.

For example, taking the {doc}`lifecycle management example </kedro_project_setup/session>`
as a starting point:

```python
from pathlib import Path

import mlflow
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

bootstrap_project(Path.cwd())

mlflow.set_experiment("Kedro Spaceflights test")

with KedroSession.create() as session:
with mlflow.start_run():
mlflow.set_tag("session_id", session.session_id)
session.run()
```

If you want more flexibility or to log extra parameters,
you might need to run the Kedro pipelines manually yourself.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.