-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mlflow for better versioning and collaboration #113
Comments
Hi @Galileo-Galilei! We're so glad to hear that you've found Kedro useful and I think it's fantastic that you're building on top of it. Let me see if I can address some of the thoughts that you've raised:
|
Hello @yetudada, many thanks for the reply. I was quite busy at work recently but I will definitely try to make a kedro-mlflow plugin by the end of the year. Some comments about the different points you answered to :
but when passing a decorator I can only access to the I do not see how I can log the inputs without access to run_node (but I am open to any less hacky solution).
|
Interesting! I really agree with you guys that MLflow is a natural extension of Kedro. At MFlux.ai, we have made a tutorial that shows a simple example of how to use a combination of Kedro and MLflow in one project: https://www.mflux.ai/tutorials/ml-pipeline/ |
Hi @Galileo-Galilei! I hope that you're well. Let me work my way through your comments.
2 a. This makes sense. Let us know if you need help with
|
@iver56 This is great! What was your experience like using both tools? |
@yetudada I'm glad you liked it! Adding mlflow to a kedro project felt like just adding some mlflow function calls here and there. I feel like mlflow fits in most places without a need to rewrite whatever code that's already lying around. Kedro is more opinionated - it's more like a starting template for data science projects. If I want to start using Kedro in an existing code base, it has significant implications - I feel like I have to rewrite/refactor code and do many things in the Kedro way, which sometimes can feel like a hindrance. But on the other hand, I guess doing things in the Kedro way gives the project a common structure that looks familiar to other people that are familiar with Kedro. That is an obvious benefit in medium-sized to large companies that use Kedro and have data engineers and data scientists that come and go. |
Hello @yetudada, some news about our progress:
@iver56 You're definitely right : the first reason that make us bend towards kedro is that it make collaboration much easier (you can actually show your pipeline with |
@iver56 That's a really great perspective. You're so correct with one of the reasons why Kedro exists; the point of creating maintainable code bases when teams change in large organisations. What changes did you have to make to your workflow to work in the Kedro way? |
@Galileo-Galilei, you're such a rockstar 🚀Well done on deploying your Kedro pipelines and getting everyone up to speed on Kedro! This makes us so happy to read this!
|
@Galileo-Galilei I should let you know that @limdauto is working on a way to extend Kedro using hooks as part of #219 and has indicated that it's extremely easy to create the MLflow plugin with this system. Are you still using your customisations? |
Hello @yetudada, sorry for not coming back here for a while, I was quite busy at work. Some news and feedback :
Pros for keeping most of the logic in ProjectContext:a. It enables handling very specific situation at the project levels, which are not intended to be generic Cons for keeping most of the logic in ProjectContext:a. Currently, I extend the context by creating inheritance fom
Everything is fine. Imagine now that I have also some Spark logic:
I cannot (easily) inherits from both Conclusion: I have never used |
Hey @Galileo-Galilei! I have so many cool things to tell you!
We actually do have an MLflow example ready for you to try:
Let us know if you want a crash-course demo, and feel like spending time with the team. |
And one more thing @Galileo-Galilei, for 4b. Could you provide an example of what you're trying to do? We're working on the modification part of Framework Redesign indicated in #219 and it would be great to understand if this problem fits there. |
Hello @yetudada, First, I have some very good news : I released a first version of kedro-mlflow. It will make our discussions more efficient as I can show you the code. By now, the package is poorly documented / tested and lacks some functionalities, but I will update it in the following months. Note that is based on kedro's develop branch and uses Regarding your questions:
Regarding 4.b, this was more a general thought on how the design should (IHMO) separate the template from the framework. Basically, I think that some informations should move from the ``ProjectContext` to the ".kedro.yml" file, because you may want to access it without loading the context. I'll make a detailed answer one day (likely in a new issue) but I have no time right now and it needs to be thoroughly thought (I don't have all the implications in mind right now of the changes I would suggest). |
Given the |
Yes, sure. Opening a PR to add the plugin to the list of Community developed plugins lies somewhere on my todo list, I'll try to do it in a near future! PS: This is not directly linked to this issue, but the last comment about moving all the informations (project name, kedro version, place to register the configloader/ the pipelines...) from the context to either the |
We've seen the continued development on the If you need a slimmed down alternative check out how to integrate MLFlow using Hooks in the Kedro documentation. |
TL;DR : The plugin is in active development here and is available on PyPI. It already works reliably with
kedro>=0.16.0
but is slightly different (and much more complete) of what is described in below issue. Feel free to try it out and give feedback. The plugin enforcesKedro
design principles when integrating mlflow (strict separation I/O vs compute, external configuration, data abstraction, cli wrappers...) to avoid breaking theKedro
experience when using mlflow and facilitate versioning and model serving.A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.
I'd be glad to discuss with kedro developpers about some architectures / design choices about it, and this is the goal of the issue.
Context
Versioning in machine learning is something very specific : you want to version a run, i.e. the execution of code on data with parameters. Versioning data alone is likely to be useless for reproducibility in future.
Databricks released recently mlflow which is intended to match this very goal. I think that it will be beneficial from kedro to build on top of what mlflow has already created in order to :
Description
The current internal versioning method in kedro does not intend to version a full "run" (code + data +parameters) which make it less useful for machine learning. Switching to mlflow for this would be a quick win to improve the framework.
Possible Implementation
My team had implemented several features :
mlflow.yml
file in conf/base folder) which enables to parameterize all mlflow features through a conf file (autologging parameters, tracking uri, experiment where the run should be stored...) which is added to the template. This is really useful since we used a "local" mlflow server where each data scientist can experiment, and a shared one with shareable models and runs and this is nice to parameterize this through a config file.MlflowDataset
class (similar to theAbstractVersionedDataset
class) which enables to decide a dataset should be logged as an mlflow artifact (i.e. theversioned
parameters incatalog.yml
is replaced by ause_mlflow: true
that you can pass to any dataset. This logs automatically the dataset as an mlflow artifact. As a best practice, we consider that we should versioned only datasets that are fitted on data (e.g. encoder, binarizer, machine learning models...)run_node
is called, the parameters that are used in the node are logged as mlflow parameters (throughmlflow.log_params
). This is customizable in themlfow.yml
conf file.kedro pull --run-id MLFLOW_RUN_ID
that enables to get data from an mlflow run and to copy them in yourdata
folder. This is really convenient to share a run with coworkers (especially since we can also retrieve the commit sha from mlflow to have the exact same code). Thispull
command also pull parameters and write them in anmlflow_parameters.yml
. It warns you about conflicts (parameters which exists both in your local conf and the mlflow run you've just pulled) and you can select by hand which one you want to keep. (To makekedro pull
works, we also decided to log some configuration files as artifacts, including the catalog and the parameters when usingkedro run
, but this is purely technical)General thoughts about the feature
I wish I had the thoughts of kedro developers:
I can understand that developers want kedro to be "self contained" and not rely on a third party application. However, I think it is definitely not a good idea to reinvent the wheel. Besides, such a change would not be harmful for kedro users :
I think it is a good way to use the "best of the 2 worlds" (mlflow offers a configuration through an
MLProject
file which is overlapping and less flexible that kedro's AFAIK, so I'd rather stick to kedro for this).The text was updated successfully, but these errors were encountered: