Evaluate logging input data #446

Galileo-Galilei · 2023-08-18T21:06:28Z

Description

Mlflow introduced an API to log dataset. I should evaluate the opportunity to integrate it.

Context

Logging input data is useful for reproducibility.

Possible implmentation

Create a MLflowDatasetDataset?

The text was updated successfully, but these errors were encountered:

lvijnck · 2024-08-28T12:14:20Z

Hi @Galileo-Galilei

# catalog
integration.prm.unified_edges:
  <<: *_layer_prm
  type: matrix.datasets.mlflow.MlFlowInputDataDataSet
  name: edges
  context: integration
  dataset:
    <<: *_spark_parquet
    filepath: ${globals:paths.prm}/unified/edges

"""Custom Mlflow datasets."""
import mlflow

import pandas as pd
from copy import deepcopy
from typing import Any, Dict, Union

from mlflow.tracking import MlflowClient

from kedro_mlflow.io.metrics.mlflow_abstract_metric_dataset import (
    MlflowAbstractMetricDataset,
)

from kedro_datasets.pandas import ParquetDataset, CSVDataset
from kedro_datasets.spark import SparkDataset
from kedro_datasets.spark.spark_dataset import _strip_dbfs_prefix

from kedro.io.core import PROTOCOL_DELIMITER, AbstractDataset

from refit.v1.core.inject import _parse_for_objects


class MlFlowInputDataDataSet(AbstractDataset):
    """Kedro dataset to represent MLFlow Input Dataset."""

    def __init__(
        self,
        *,
        name: str,
        context: str,
        dataset: AbstractDataset,
        metadata: dict[str, Any] | None = None,
    ):
        """Initialise MlflowMetricDataset.

        Args:
            name (str): name of dataset in MLFlow
            context: context where dataset is used
            dataset: Underlying Kedro dataset
            metadata: kedro metadata
        """
        self._name = name
        self._context = context
        self._dataset = _parse_for_objects(dataset)

    def _load(self) -> Any:
        return self._dataset._load()

    def _save(self, data):
        self._dataset.save(data)

        # FUTURE: Support other datasets
        # FUTURE: Fix the source of data
        # https://github.com/mlflow/mlflow/issues/13015
        if any(isinstance(self._dataset, ds) for ds in [ParquetDataset, CSVDataset]):
            ds = mlflow.data.from_pandas(
                data, name=self._name, source=self._get_full_path(self._dataset)
            )
        elif isinstance(self._dataset, SparkDataset):
            ds = mlflow.data.from_spark(
                data,
                name=self._name,
                path=_strip_dbfs_prefix(
                    self._dataset._fs_prefix + str(self._dataset._get_load_path())
                ),
            )
        else:
            raise NotImplementedError(
                f"MLFlow Logging for dataset of type {type(self._dataset)} not implemented!"
            )

        mlflow.log_input(ds, context=self._context)

    @staticmethod
    def _get_full_path(dataset: AbstractDataset):
        return f"{dataset._protocol}://{str(dataset._filepath)}"

    def _describe(self) -> Dict[str, Any]:
        """Describe MLflow metrics dataset.

        Returns:
            Dict[str, Any]: Dictionary with MLflow metrics dataset description.
        """
        return {"context": self._context, "name": self._name}

We're currently using the implementation above in our pipeline.

This was referenced Aug 22, 2023

Allow mlflow hooks to be overwriten, and more choice on what to log #442

Open

Track the globals parameters used in the DataCatalog when using the TemplatedConfigLoader #253

Open

Galileo-Galilei added enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen labels Oct 25, 2023

Galileo-Galilei added this to kedro-mlflow roadmap Oct 28, 2023

Galileo-Galilei moved this to 🆕 New in kedro-mlflow roadmap Oct 28, 2023

Galileo-Galilei moved this from 🆕 New to 📋 Backlog in kedro-mlflow roadmap Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate logging input data #446

Evaluate logging input data #446

Galileo-Galilei commented Aug 18, 2023

lvijnck commented Aug 28, 2024 •

edited

Loading

Evaluate logging input data #446

Evaluate logging input data #446

Comments

Galileo-Galilei commented Aug 18, 2023

Description

Context

Possible implmentation

lvijnck commented Aug 28, 2024 • edited Loading

lvijnck commented Aug 28, 2024 •

edited

Loading