is it recommended to pass intermediate ML object between nodes? #505

miyamonz · 2020-09-08T06:36:29Z

What are you trying to do?

I'm trying to use Kedro as an ML framework.

For example, pytorch-ignite or pytoch-lightning and so on are famous.
and I want to use Kedro for such a purpose.

like this:

this pipeline is for fine-tuning pretrained model. and you can see this pipeline pass optimizer objects as dataset.
the get optimizers node receives pretrained model object and makes optimizers dataset that contains optimizer and scheduler.
That is, I'm passing pytorch objects as Kedro's DataSet.
"intermediate" that I wrote on this title means such objects.

But I can't find these use cases by searching on this Github repo and the Internet.
it seems that Kedro pipelines and nodes basically handle data that can convert pd.DataFrame or CSV something, or final result of the ML model to save.

so, I want to know whether this use case is good or bad for Kedro's contributors.
or someone already does like this or know this use case, please let me know.

The text was updated successfully, but these errors were encountered:

921kiyo · 2020-09-08T11:11:36Z

Hi @miyamonz
Thank you for using Kedro, and welcome to our community!

Kedro provides a number of built-in datasets and we (and contributors) keep adding new datasets for handling other data formats, including Tensorflow model (see https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.tensorflow.TensorFlowModelDataset.html). But we don't have any built-in datasets for PyTorch model yet.

What you can do is to add a custom dataset (see the details for how to implement it in https://kedro.readthedocs.io/en/stable/07_extend_kedro/01_custom_datasets.html#custom-datasets), similar to TensorFlowDataSet (you can find the source code for TensorFlowDataSet in https://github.com/quantumblacklabs/kedro/blob/master/kedro/extras/datasets/tensorflow/tensorflow_model_dataset.py).

Or if PyTorch objects are pickable, you could use pickle.PickleDataSet (so no need to create a custom dataset in that case).

And we are more than welcome your contribution for a new dataset into Kedro :)

Hope this helps. Please let me know if you have any questions.

Minyus · 2020-09-08T14:39:12Z

Kedro's pipeline does not support cyclic dependency, so it might be tough to use for a repetitive process such as multiple epochs of training neural network models.

To use PyTorch Ignite with Kedro, I developed a wrapper (declarative high-level API) of PyTorch Ignite and open-sourced as part of my PipelineX package:
https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/ops/ignite/declaratives/declarative_trainer.py

Here is an example project to use the wrapper of PyTorch Ignite:
https://github.com/Minyus/pipelinex_pytorch

PyTorch Lightning provides a high-level API called "Trainer", so this could be used with Kedro.
https://pytorch-lightning.readthedocs.io/en/latest/trainer.html

miyamonz · 2020-09-09T09:02:59Z

Thanks for your answers!
It helps me to know that my way is not wrong.

If so, I found a difficult point for a beginner who faces a similar situation and I want to let you know about it.

When I pass intermediate objects such as optimizer between nodes without any config, the objects are default MemoryDataSet with deep copy mode.
But it makes it wrong because the optimizer has reference to the pre-trained model's parameters and MemoryDataset will copy the optimizer object then it will be broken.

There is no wrong method call or type error between nodes because MemoryDataSet just copied the object.
So training will run with no error but of course, the accuracy won't be improved.

when I saw the word MemoryDataSet first time, I consider it works like assign_mode, so it takes a long time to find my mistake.
then, write like this and my pipeline works.

optimizers:
  type: MemoryDataSet
  copy_mode: assign

so I think this behavior should be documented.

I've realized since I wrote this, it already documented here.
I don't use Spark, so I can't found it. 😂

miyamonz · 2020-09-09T09:03:12Z

@Minyus

Kedro's pipeline does not support cyclic dependency, so it might be tough to use for a repetitive process such as multiple epochs of training neural network models.

Yes, it's important.
Fortunately, I originally thought epoch loops have to be inside a node, so I didn't get the problem.

and I try to see PipelineX. thanks!

921kiyo · 2020-09-09T09:19:19Z

I believe the original question has been solved. I'm closing this but feel free to reopen it or open a new issue (or post a question in StackOverflow https://stackoverflow.com/questions/tagged/kedro), if you have any follow-up questions :)

miyamonz added the Issue: Question label Sep 8, 2020

921kiyo closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it recommended to pass intermediate ML object between nodes? #505

is it recommended to pass intermediate ML object between nodes? #505

miyamonz commented Sep 8, 2020

921kiyo commented Sep 8, 2020

Minyus commented Sep 8, 2020

miyamonz commented Sep 9, 2020

miyamonz commented Sep 9, 2020

921kiyo commented Sep 9, 2020

is it recommended to pass intermediate ML object between nodes? #505

is it recommended to pass intermediate ML object between nodes? #505

Comments

miyamonz commented Sep 8, 2020

What are you trying to do?

921kiyo commented Sep 8, 2020

Minyus commented Sep 8, 2020

miyamonz commented Sep 9, 2020

miyamonz commented Sep 9, 2020

921kiyo commented Sep 9, 2020