Samples - Added the data passing tutorial #2258

Ark-kun · 2019-09-27T21:20:39Z

This change is

Ark-kun · 2019-09-28T00:27:57Z

/retest

kevinbache · 2019-09-28T00:53:26Z

This is a nice start.

I think this example would benefit from more explanation. For example, when you introduce the big data processing method, it would make sense to say something like "The way Kubeflow Pipelines passes complex objects between components is by writing them to a file in the producer component and reading from the file consumer component. This means that the file's full contents don't ever have to be stored in memory or passed between components, only a pointer to the file need be passed."

Secondly, the semantics of OutputFile(str) vs OutputFile(int) could use an explanation. Does this mean that the OutputFile is referred to by a str/int? Does it mean that the contents of the file will be a str/int? Is OutputFile(str) long-hand for OutputTextFile? Why are InputFile and OutputFile separate classes rather than just a unified FileReference class? Incidentally, it might make sense to rename InputFile to InputFileReference or some such, since these objects don't fulfill Python's File-Like contract.

Also, the Binary imports aren't used.

Good work.

Ark-kun · 2019-09-28T01:19:19Z

Thank you for the review.
I'll incorporate some parts of your feedback into this tutorial and address other parts of the feedback.

I think this example would benefit from more explanation. For example, when you introduce the big data processing method

This makes sense. Thanks for the write-up - I'll include it in the tutorial.

Does this mean that the OutputFile is referred to by a str/int?
Does it mean that the contents of the file will be a str/int?

It's the latter. Normally, when you add parameter annotations, you want to specify the type of the data that your function or component wants to receive (e.g. arr: 'JsonArray'). But with bigger data, you also want to specify the data delivery method and the data flow direction (input or output): arr_file_path: InputPath('JsonArray') = "My component has an input of type "JsonArray", but it's big, so I want the system to write the data to a file and pass the path to that file into the function".

Is OutputFile(str) long-hand for OutputTextFile?

There is no OutputFile in the API. There are:

OutputPath - a local file path - a string. Useful when you want to open and close the file multiple times or pass the path to another program.
OutputTextFile - an opened text file - python's TextIOWrapper
OutputBinaryFile - an opened binary file - python's BinaryIO

Why are InputFile and OutputFile separate classes rather than just a unified FileReference class?

There are not InputFile or OutputFile classes...
There is a need to understand the direction of data flow. Which data does this function consume and which data does it produce. What are inputs and outputs.

Incidentally, it might make sense to rename InputFile to InputFileReference or some such, since these objects don't fulfill Python's File-Like contract.

These "annotation" classes are only used for annotating the function signature.
In some sense InputTextFile, OutputTextFile, InputBinaryFile and OutputBinaryFile do fulfill python's file-like contract since the objects the function will receive are python's native file objects produced by calling open.
So if you declare your function as def func(text_file: InputTextFile()), then at runtime text_file will be a proper file object.

The parameters annotated with InputPath and OutputPath will receive local file paths (strings).

Also, the Binary imports aren't used.

Probably they're too much for this tutorial (although they behave the same as the *TextFile annotations). I could demonstrate using them to save/load some data-frame objects, but that might be too much.

Do you think that InputTextFile and InputBinaryFile annotations are too confusing compared to the InputPath annotation?

kevinbache · 2019-09-28T20:49:51Z

So if you declare your function as def func(text_file: InputTextFile()), then at runtime text_file will be a proper file object.

It's going to be a different object type at runtime than the call signature declares? That's very confusing for the user. Will autocomplete work? What about Python load methods which require a file path rather than an open file object?

Ark-kun · 2019-09-30T20:24:42Z

It's going to be a different object type at runtime than the call signature declares? Will autocomplete work?

I can make another PR to make them inherit from the proper python classes.

What about Python load methods which require a file path rather than an open file object?

That's what the InputPath and OutputPath annotations are for. As you see, they are introduced first in the "Bigger data" tutorial.

SinaChavoshi

apart from the question on create_run_from_pipeline_func () it /lgtm

p.s. Good job! this looks like a very nice way of passing data between components.

SinaChavoshi · 2019-09-30T20:59:14Z

samples/tutorials/Data passing in python components.ipynb

+    "    print(text)\n",
+    "\n",
+    "def constant_to_consumer_pipeline():\n",
+    "    '''Pipeline that passes small constant string to to consumer'''\n",


nit: One-line docstring summary should end with "." , same with others in the doc.

samples/tutorials/Data passing in python components.ipynb

Ark-kun · 2019-09-30T22:08:40Z

Judging by customer feedback, it's important for us to add this tutorial.
The topic of data passing was a big pain point for every user we worked with. I think that giving users a straightforward solution and clear guidance is important.

Another recent comment here on GitHub from Nubank company which wants to use KFP: "Just took a look at your notebook, I wasn't aware of InputPath and OutputTextFile, they will be very handy."

Ark-kun · 2019-10-01T20:38:29Z

I've added big explanation sections to the tutorial.

SinaChavoshi · 2019-10-01T21:21:37Z

/lgtm

Ark-kun · 2019-10-01T22:44:12Z

/approve

k8s-ci-robot · 2019-10-01T22:44:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~samples/OWNERS~~ [Ark-kun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaoning777 · 2019-10-02T05:08:38Z

Why do we create the new directory? And, we probably need to add tests for it.

Ark-kun · 2019-10-04T01:38:45Z

we probably need to add tests for it.

I'll add it to the tests. There are two small issues preventing this, but I'll do that as soon as they're fixed.

gaoning777 · 2019-10-04T02:45:56Z

And, why not add it to the core/ directory?

karlschriek · 2019-10-30T09:19:34Z

Judging by customer feedback, it's important for us to add this tutorial.
The topic of data passing was a big pain point for every user we worked with. I think that giving users a straightforward solution and clear guidance is important.

@Ark-kun, I can wholeheartedly agree with this. I found the example above to be very useful, so thanks in the first place for putting it out there! It is however quite hard to figure out how to adapt it correctly, and it is entirely unclear what is happening in the background. The example could use a lot more explanation of what the underlying classes do. Alternatively, the classes themselves in the SDK could use some more detailed explanation.

The InputBinaryFile and OutputBinaryFile concepts are especially interesting to us, as we have many use cases where we don't constantly want to be writing to / reading from disk. How to effectively use this is currently not explained anywhere. With some small tests I have thusfar not been successful in getting a small prototype up and running.

Ark-kun · 2019-11-01T00:12:25Z

The InputBinaryFile and OutputBinaryFile concepts are especially interesting to us,

The InputBinaryFile is a very small syntactic sugar over InputPath. As the tutorial mentions, it just pre-opens the file for you, saving you 1 open call and reducing the indentation.

How to effectively use this is currently not explained anywhere. With some small tests I have thusfar not been successful in getting a small prototype up and running.

Can you please create open an issue? I've very responsive to them.

It is however quite hard to figure out how to adapt it correctly,

I'd like top hear more about your problems.

The example could use a lot more explanation of what the underlying classes do. Alternatively, the classes themselves in the SDK could use some more detailed explanation.

Which concrete classes do you mean? Did you mean InputPath and company? These classes are just markers/annotations. They convey additional information about particular function parameter.

I would really like to understand which aspects to explain better.

P.S. As python components are just built on top of generic container components, it's useful to see the generated component.yaml definition and code, so that you understand exactly what's going on.

Use func_to_container_op(my_func, output_component_file='my_1.component.yaml') to create a shareable/loadable component file.

* Fix incorrect function definition, issue kubeflow#2258 Signed-off-by: Eyal Cohen <eyalcohen@eyals-mbp.haifa.il.ibm.com> Signed-off-by: Eyal Cohen <eyalcohen@sig-9-145-160-25.de.ibm.com> * Fix lint - unused imported Signed-off-by: Eyal Cohen <eyalcohen@sig-9-145-160-25.de.ibm.com> Co-authored-by: Eyal Cohen <eyalcohen@eyals-mbp.haifa.il.ibm.com>

Samples - Added the data passing tutorial

c7f7593

Ark-kun added area/samples/notebooks area/samples labels Sep 27, 2019

Ark-kun requested review from gaoning777, numerology, SinaChavoshi and hongye-sun September 27, 2019 21:20

Ark-kun assigned gaoning777, numerology, SinaChavoshi and hongye-sun Sep 27, 2019

k8s-ci-robot added the size/L label Sep 27, 2019

k8s-ci-robot requested a review from animeshsingh September 27, 2019 21:20

SinaChavoshi reviewed Sep 30, 2019

View reviewed changes

Ark-kun added 3 commits October 1, 2019 12:08

Ensure the correct SDK version is installed

38f75c4

Corrected the output names

85ef568

Added extended explanation of the data passing process

6a70faa

k8s-ci-robot added the lgtm label Oct 1, 2019

k8s-ci-robot added the approved label Oct 1, 2019

k8s-ci-robot merged commit 6be5cd7 into kubeflow:master Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samples - Added the data passing tutorial #2258

Samples - Added the data passing tutorial #2258

Ark-kun commented Sep 27, 2019 •

edited by jlewi

Loading

Ark-kun commented Sep 28, 2019

kevinbache commented Sep 28, 2019

Ark-kun commented Sep 28, 2019 •

edited

Loading

kevinbache commented Sep 28, 2019 •

edited

Loading

Ark-kun commented Sep 30, 2019

SinaChavoshi left a comment •

edited

Loading

SinaChavoshi Sep 30, 2019

Ark-kun commented Sep 30, 2019

Ark-kun commented Oct 1, 2019

SinaChavoshi commented Oct 1, 2019

Ark-kun commented Oct 1, 2019

k8s-ci-robot commented Oct 1, 2019

gaoning777 commented Oct 2, 2019

Ark-kun commented Oct 4, 2019

gaoning777 commented Oct 4, 2019

karlschriek commented Oct 30, 2019 •

edited

Loading

Ark-kun commented Nov 1, 2019

Samples - Added the data passing tutorial #2258

Samples - Added the data passing tutorial #2258

Conversation

Ark-kun commented Sep 27, 2019 • edited by jlewi Loading

Ark-kun commented Sep 28, 2019

kevinbache commented Sep 28, 2019

Ark-kun commented Sep 28, 2019 • edited Loading

kevinbache commented Sep 28, 2019 • edited Loading

Ark-kun commented Sep 30, 2019

SinaChavoshi left a comment • edited Loading

Choose a reason for hiding this comment

SinaChavoshi Sep 30, 2019

Choose a reason for hiding this comment

Ark-kun commented Sep 30, 2019

Ark-kun commented Oct 1, 2019

SinaChavoshi commented Oct 1, 2019

Ark-kun commented Oct 1, 2019

k8s-ci-robot commented Oct 1, 2019

gaoning777 commented Oct 2, 2019

Ark-kun commented Oct 4, 2019

gaoning777 commented Oct 4, 2019

karlschriek commented Oct 30, 2019 • edited Loading

Ark-kun commented Nov 1, 2019

Ark-kun commented Sep 27, 2019 •

edited by jlewi

Loading

Ark-kun commented Sep 28, 2019 •

edited

Loading

kevinbache commented Sep 28, 2019 •

edited

Loading

SinaChavoshi left a comment •

edited

Loading

karlschriek commented Oct 30, 2019 •

edited

Loading