Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it recommended to pass intermediate ML object between nodes? #505

Closed
miyamonz opened this issue Sep 8, 2020 · 5 comments
Closed

is it recommended to pass intermediate ML object between nodes? #505

miyamonz opened this issue Sep 8, 2020 · 5 comments

Comments

@miyamonz
Copy link

miyamonz commented Sep 8, 2020

What are you trying to do?

I'm trying to use Kedro as an ML framework.

For example, pytorch-ignite or pytoch-lightning and so on are famous.
and I want to use Kedro for such a purpose.

like this:

Image from Gyazo
this pipeline is for fine-tuning pretrained model. and you can see this pipeline pass optimizer objects as dataset.
the get optimizers node receives pretrained model object and makes optimizers dataset that contains optimizer and scheduler.
That is, I'm passing pytorch objects as Kedro's DataSet.
"intermediate" that I wrote on this title means such objects.

But I can't find these use cases by searching on this Github repo and the Internet.
it seems that Kedro pipelines and nodes basically handle data that can convert pd.DataFrame or CSV something, or final result of the ML model to save.

so, I want to know whether this use case is good or bad for Kedro's contributors.
or someone already does like this or know this use case, please let me know.

@921kiyo
Copy link
Contributor

921kiyo commented Sep 8, 2020

Hi @miyamonz
Thank you for using Kedro, and welcome to our community!

Kedro provides a number of built-in datasets and we (and contributors) keep adding new datasets for handling other data formats, including Tensorflow model (see https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.tensorflow.TensorFlowModelDataset.html). But we don't have any built-in datasets for PyTorch model yet.

What you can do is to add a custom dataset (see the details for how to implement it in https://kedro.readthedocs.io/en/stable/07_extend_kedro/01_custom_datasets.html#custom-datasets), similar to TensorFlowDataSet (you can find the source code for TensorFlowDataSet in https://github.com/quantumblacklabs/kedro/blob/master/kedro/extras/datasets/tensorflow/tensorflow_model_dataset.py).

Or if PyTorch objects are pickable, you could use pickle.PickleDataSet (so no need to create a custom dataset in that case).

And we are more than welcome your contribution for a new dataset into Kedro :)

Hope this helps. Please let me know if you have any questions.

@Minyus
Copy link
Contributor

Minyus commented Sep 8, 2020

Kedro's pipeline does not support cyclic dependency, so it might be tough to use for a repetitive process such as multiple epochs of training neural network models.

To use PyTorch Ignite with Kedro, I developed a wrapper (declarative high-level API) of PyTorch Ignite and open-sourced as part of my PipelineX package:
https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/ops/ignite/declaratives/declarative_trainer.py

Here is an example project to use the wrapper of PyTorch Ignite:
https://github.com/Minyus/pipelinex_pytorch

PyTorch Lightning provides a high-level API called "Trainer", so this could be used with Kedro.
https://pytorch-lightning.readthedocs.io/en/latest/trainer.html

@miyamonz
Copy link
Author

miyamonz commented Sep 9, 2020

Thanks for your answers!
It helps me to know that my way is not wrong.

If so, I found a difficult point for a beginner who faces a similar situation and I want to let you know about it.

When I pass intermediate objects such as optimizer between nodes without any config, the objects are default MemoryDataSet with deep copy mode.
But it makes it wrong because the optimizer has reference to the pre-trained model's parameters and MemoryDataset will copy the optimizer object then it will be broken.

There is no wrong method call or type error between nodes because MemoryDataSet just copied the object.
So training will run with no error but of course, the accuracy won't be improved.

when I saw the word MemoryDataSet first time, I consider it works like assign_mode, so it takes a long time to find my mistake.
then, write like this and my pipeline works.

optimizers:
  type: MemoryDataSet
  copy_mode: assign

so I think this behavior should be documented.


I've realized since I wrote this, it already documented here.
I don't use Spark, so I can't found it. 😂

@miyamonz
Copy link
Author

miyamonz commented Sep 9, 2020

@Minyus

Kedro's pipeline does not support cyclic dependency, so it might be tough to use for a repetitive process such as multiple epochs of training neural network models.

Yes, it's important.
Fortunately, I originally thought epoch loops have to be inside a node, so I didn't get the problem.

and I try to see PipelineX. thanks!

@921kiyo
Copy link
Contributor

921kiyo commented Sep 9, 2020

I believe the original question has been solved. I'm closing this but feel free to reopen it or open a new issue (or post a question in StackOverflow https://stackoverflow.com/questions/tagged/kedro), if you have any follow-up questions :)

@921kiyo 921kiyo closed this as completed Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants