Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup mlflow before KedroContext #292

Closed
stephanecollot opened this issue Mar 10, 2022 · 5 comments
Closed

Setup mlflow before KedroContext #292

stephanecollot opened this issue Mar 10, 2022 · 5 comments

Comments

@stephanecollot
Copy link

Hello,

I have a custom KedroContext (where I initialise a Spark session) and I would like to log things into mlflow at this moment.
But if I log things there, it doesn't take into consideration my mlflow.yml parameters.

I tried to call in my KedroContext:

        mlflow_config = get_mlflow_config()
        mlflow_config.setup()

But I got the following error:
RuntimeError: There is no active Kedro session

@stephanecollot
Copy link
Author

If you have any idea for a workaround, feel free to share it!

@Galileo-Galilei
Copy link
Owner

Hi @stephanecollot,

sorry for the response delay. Actually this is a tricky question and I don't have a clear solution, but I might guide you to find a workaround which may suits your use case.

A bit of history

In earlier kedro versions (e.g. 0.15.X), the KedroContext was the only place where you could add custom code to interact with kedro during the execution (e.g. when you launch kedro run). This was not convenient because :

  • it created cluttered and hard to maintain custom ProjectContext
  • different logic where hard to compose: if I created a MlflowContext and you a SparkContext, one should inherit from the other to enable to compose the 2 logic. If we have more than 2 custom logic, it became completly intractable in practice.

With kedro=0.16.X, kedro introduced hooks. This adds the possibility to compose and distribute easily different custom code ; in return we lost the ability to inject code anywhere during the run but rather at some predefined places (especially before / after pipeline and node execution).

In general, it is no longer recommended to add custom logic inside the context BUT their documentation still suggest that it is the recommended way to initialise a spark context. I think it is no longer the case, but the reason while they are doing this instead of a hook is because they need to access a spark.yml config file, which is hard to retrieve inside hooks, even if there are solutions for this from kedro>=0.17.X.

On execution order

When launching kedro run, the KedroSession is instantiated first, and during its instantation the ProjectContext is instantiated. This explains why you will never be able to retrieve configuration (and "setup" mlflow here) because the Session simply does not exist at this moment, and obviously is not activated yet.

Potential solutions

Solution 1: Keep your custom context and log in mlflow in a hook

Create a custom hook:

class MyMlflowHook:
    """Namespace for grouping all model-tracking hooks with MLflow together."""

    @hook_impl
    def before_pipeline_run(self, run_params: Dict[str, Any]) -> None:
        """Hook implementation to start an MLflow run
        with the same run_id as the Kedro pipeline run.
        """
        mlflow.log_xxx("<whatever>")

You can register it in settings.py

This will be triggered after the ProjectContext initialisation, but it may feels a uncomfortable to navigate between the context and the hook to log what you need.

Solution 2: Move everything inside a hook

class SparkMlflowHook:
    """Namespace for grouping all model-tracking hooks with MLflow together."""

    @hook_impl
    def before_pipeline_run(self, run_params: Dict[str, Any]) -> None:
        """Hook implementation to start an MLflow run
        with the same run_id as the Kedro pipeline run.
        """
       # FIRST, get the session
        session = get_current_session()
        context= session.load_context()
        config_loader= context.config_loader()

       # SECOND, initialize the spark session
        # Load the spark configuration in spark.yaml using the config loader
        parameters = config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(self.package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )
        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")

       # THIRD, log in mlflow (no need to use "start_run" since kedro-mlflow hook has already be executed just before
        mlflow.log_xxx("xxx")

This implies recreating the context, but can you tell me it it suits your need?

@stephanecollot
Copy link
Author

Thanks for this very detailed and interesting answer.

To be more specific I would like to log in MLflow the spark configuration and spark application id.

I'm going to try your solution 2.
I'm using kedro==0.17.4

@Galileo-Galilei
Copy link
Owner

Hi @stephanecollot did you manage to make it work? As stated above, this is not really a bug and I can't do anything so I'll close this issue, but I can help you to achieve what you want.

@stephanecollot
Copy link
Author

Hi,

Thank you a lot, it works like a charm!

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants