Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples - Added the data passing tutorial #2258

Conversation

Ark-kun
Copy link
Contributor

@Ark-kun Ark-kun commented Sep 27, 2019

This change is Reviewable

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 28, 2019

/retest

@kevinbache
Copy link
Contributor

This is a nice start.

I think this example would benefit from more explanation. For example, when you introduce the big data processing method, it would make sense to say something like "The way Kubeflow Pipelines passes complex objects between components is by writing them to a file in the producer component and reading from the file consumer component. This means that the file's full contents don't ever have to be stored in memory or passed between components, only a pointer to the file need be passed."

Secondly, the semantics of OutputFile(str) vs OutputFile(int) could use an explanation. Does this mean that the OutputFile is referred to by a str/int? Does it mean that the contents of the file will be a str/int? Is OutputFile(str) long-hand for OutputTextFile? Why are InputFile and OutputFile separate classes rather than just a unified FileReference class? Incidentally, it might make sense to rename InputFile to InputFileReference or some such, since these objects don't fulfill Python's File-Like contract.

Also, the Binary imports aren't used.

Good work.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 28, 2019

Thank you for the review.
I'll incorporate some parts of your feedback into this tutorial and address other parts of the feedback.

I think this example would benefit from more explanation. For example, when you introduce the big data processing method

This makes sense. Thanks for the write-up - I'll include it in the tutorial.

Does this mean that the OutputFile is referred to by a str/int?
Does it mean that the contents of the file will be a str/int?

It's the latter. Normally, when you add parameter annotations, you want to specify the type of the data that your function or component wants to receive (e.g. arr: 'JsonArray'). But with bigger data, you also want to specify the data delivery method and the data flow direction (input or output): arr_file_path: InputPath('JsonArray') = "My component has an input of type "JsonArray", but it's big, so I want the system to write the data to a file and pass the path to that file into the function".

Is OutputFile(str) long-hand for OutputTextFile?

There is no OutputFile in the API. There are:

  • OutputPath - a local file path - a string. Useful when you want to open and close the file multiple times or pass the path to another program.
  • OutputTextFile - an opened text file - python's TextIOWrapper
  • OutputBinaryFile - an opened binary file - python's BinaryIO

Why are InputFile and OutputFile separate classes rather than just a unified FileReference class?

There are not InputFile or OutputFile classes...
There is a need to understand the direction of data flow. Which data does this function consume and which data does it produce. What are inputs and outputs.

Incidentally, it might make sense to rename InputFile to InputFileReference or some such, since these objects don't fulfill Python's File-Like contract.

These "annotation" classes are only used for annotating the function signature.
In some sense InputTextFile, OutputTextFile, InputBinaryFile and OutputBinaryFile do fulfill python's file-like contract since the objects the function will receive are python's native file objects produced by calling open.
So if you declare your function as def func(text_file: InputTextFile()), then at runtime text_file will be a proper file object.

The parameters annotated with InputPath and OutputPath will receive local file paths (strings).

Also, the Binary imports aren't used.

Probably they're too much for this tutorial (although they behave the same as the *TextFile annotations). I could demonstrate using them to save/load some data-frame objects, but that might be too much.

Do you think that InputTextFile and InputBinaryFile annotations are too confusing compared to the InputPath annotation?

@kevinbache
Copy link
Contributor

kevinbache commented Sep 28, 2019

So if you declare your function as def func(text_file: InputTextFile()), then at runtime text_file will be a proper file object.

It's going to be a different object type at runtime than the call signature declares? That's very confusing for the user. Will autocomplete work? What about Python load methods which require a file path rather than an open file object?

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 30, 2019

It's going to be a different object type at runtime than the call signature declares? Will autocomplete work?

I can make another PR to make them inherit from the proper python classes.

What about Python load methods which require a file path rather than an open file object?

That's what the InputPath and OutputPath annotations are for. As you see, they are introduced first in the "Bigger data" tutorial.

Copy link
Contributor

@SinaChavoshi SinaChavoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the question on create_run_from_pipeline_func () it /lgtm

p.s. Good job! this looks like a very nice way of passing data between components.

" print(text)\n",
"\n",
"def constant_to_consumer_pipeline():\n",
" '''Pipeline that passes small constant string to to consumer'''\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: One-line docstring summary should end with "." , same with others in the doc.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Sep 30, 2019

Judging by customer feedback, it's important for us to add this tutorial.
The topic of data passing was a big pain point for every user we worked with. I think that giving users a straightforward solution and clear guidance is important.

Another recent comment here on GitHub from Nubank company which wants to use KFP: "Just took a look at your notebook, I wasn't aware of InputPath and OutputTextFile, they will be very handy."

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Oct 1, 2019

I've added big explanation sections to the tutorial.

@SinaChavoshi
Copy link
Contributor

/lgtm

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Oct 1, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6be5cd7 into kubeflow:master Oct 1, 2019
@gaoning777
Copy link
Contributor

Why do we create the new directory? And, we probably need to add tests for it.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Oct 4, 2019

we probably need to add tests for it.

I'll add it to the tests. There are two small issues preventing this, but I'll do that as soon as they're fixed.

@gaoning777
Copy link
Contributor

And, why not add it to the core/ directory?

@karlschriek
Copy link

karlschriek commented Oct 30, 2019

Judging by customer feedback, it's important for us to add this tutorial.
The topic of data passing was a big pain point for every user we worked with. I think that giving users a straightforward solution and clear guidance is important.

@Ark-kun, I can wholeheartedly agree with this. I found the example above to be very useful, so thanks in the first place for putting it out there! It is however quite hard to figure out how to adapt it correctly, and it is entirely unclear what is happening in the background. The example could use a lot more explanation of what the underlying classes do. Alternatively, the classes themselves in the SDK could use some more detailed explanation.

The InputBinaryFile and OutputBinaryFile concepts are especially interesting to us, as we have many use cases where we don't constantly want to be writing to / reading from disk. How to effectively use this is currently not explained anywhere. With some small tests I have thusfar not been successful in getting a small prototype up and running.

@Ark-kun
Copy link
Contributor Author

Ark-kun commented Nov 1, 2019

The InputBinaryFile and OutputBinaryFile concepts are especially interesting to us,

The InputBinaryFile is a very small syntactic sugar over InputPath. As the tutorial mentions, it just pre-opens the file for you, saving you 1 open call and reducing the indentation.

How to effectively use this is currently not explained anywhere. With some small tests I have thusfar not been successful in getting a small prototype up and running.

Can you please create open an issue? I've very responsive to them.

It is however quite hard to figure out how to adapt it correctly,

I'd like top hear more about your problems.

The example could use a lot more explanation of what the underlying classes do. Alternatively, the classes themselves in the SDK could use some more detailed explanation.

Which concrete classes do you mean? Did you mean InputPath and company? These classes are just markers/annotations. They convey additional information about particular function parameter.

I would really like to understand which aspects to explain better.

P.S. As python components are just built on top of generic container components, it's useful to see the generated component.yaml definition and code, so that you understand exactly what's going on.

Use func_to_container_op(my_func, output_component_file='my_1.component.yaml') to create a shareable/loadable component file.

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
* Fix incorrect function definition, issue kubeflow#2258

Signed-off-by: Eyal Cohen <eyalcohen@eyals-mbp.haifa.il.ibm.com>
Signed-off-by: Eyal Cohen <eyalcohen@sig-9-145-160-25.de.ibm.com>

* Fix lint - unused imported

Signed-off-by: Eyal Cohen <eyalcohen@sig-9-145-160-25.de.ibm.com>

Co-authored-by: Eyal Cohen <eyalcohen@eyals-mbp.haifa.il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants