SDK - Lightweight - Added support for file outputs #2221

Ark-kun · 2019-09-24T17:58:33Z

Lightweight components now allow function to mark some outputs that it wants to produce by writing data to files, not returning it as in-memory data objects.
This is useful when the data is expected to be big.

Example 1 (writing big amount of data to output file with provided path):

@func_to_container_op
def write_big_data(big_file_path: OutputPath(str)):
    with open(big_file_path) as big_file:
        for i in range(1000000):
            big_file.write('Hello world\n')

Example 2 (writing big amount of data to provided output file stream):

@func_to_container_op
def write_big_data(big_file: OutputTextFile(str)):
    for i in range(1000000):
        big_file.write('Hello world\n')

This change is

Lightweight components now allow function to mark some outputs that it wants to produce by writing data to files, not returning it as in-memory data objects. This is useful when the data is expected to be big. Example 1 (writing big amount of data to output file with provided path): ```python @func_to_container_op def write_big_data(big_file_path: OutputPath(str)): with open(big_file_path) as big_file: for i in range(1000000): big_file.write('Hello world\n') ``` Example 2 (writing big amount of data to provided output file stream): ```python @func_to_container_op def write_big_data(big_file: OutputTextFile(str)): for i in range(1000000): big_file.write('Hello world\n') ```

numerology · 2019-09-24T18:43:24Z

Good job! General question: is that possible to use OutputPath/OutputTextFile/OutputBinaryFile with return statement and type hints? Or, can we merge OutputPath and InputPath into one class, say ArtifactPath, and use it in both the component producing it and the component consuming it. I vaguely feel it would be a more consistent experience. WDYT?

Ark-kun · 2019-09-24T21:21:11Z

is that possible to use OutputPath/OutputTextFile/OutputBinaryFile with return statement and type hints?

This is not possible since the input/output paths must be known to the system at compile time.
Thus the input/output paths must be passed into the function by the system as opposed to function generating and returning the paths at runtime. So, a bit confusingly, the component output paths are inputs for the program/function.

Or, can we merge OutputPath and InputPath into one class, say ArtifactPath, and use it in both the component producing it and the component consuming it.

The function signature needs to tell the system which path parameters are inputs and which are outputs. InputPath and OutputPath are just dummy markers to convey that information to the system.

Ark-kun · 2019-09-24T21:27:51Z

An example of function using both file inputs and outputs:

@func_to_container_op
def write_big_data(input_file: InputTextFile(str), output_file: OutputTextFile(str)):
    while True:
        line = input_file.readline()
        if line is None:
            break
        output_file.write('Hello ' + line)

numerology · 2019-09-25T00:20:49Z

/lgtm

Ark-kun · 2019-09-25T00:33:48Z

/approve

k8s-ci-robot · 2019-09-25T00:33:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/OWNERS~~ [Ark-kun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kevinbache · 2019-09-25T00:54:28Z

in example 2, i find it weird that we're using the passed location parameter as the object we're calling .write on.

kevinbache · 2019-09-25T00:55:34Z

sdk/python/kfp/components/_python_op.py

+class OutputPath:
+    '''When creating component from function, OutputPath should be used as function parameter annotation to tell the system that the function wants to output data by writing it into a file with the given path instead of returning the data from the function.'''
+    def __init__(self, type=None):
+        self.type = type


It's the input/output type.

Before
def func(data: list):
after
def func(data_file: InputFile(list)):

timofal · 2020-09-18T20:32:54Z

@Ark-kun Hello!
I spent a lot of time trying to figure out if I can pass big binary file between steps in case I manually write docker images for each step, i.e. component. Can I?

In other words: you are showing how to pass big binary file using@func_to_container_op. I need the same, but I do build manually-written docker images for each step. Also I don't want to use storage like GCS to save intermediate data. I want k8s to pass data on it's own. Is it possible?

I appreciate any advises.

numerology · 2020-09-18T20:40:30Z

Hi @timofal

I think the canonical way of approaching your use case is the following

When using @func_to_container_op one can specify the base_image as the image you've just built. Make sure it's stored in a registry which KFP have access to;
In the producer python function, use OutputPath to annotate the location where the data will be written; and in the consumer python function, use InputPath to annotate the location where the data is read from.

timofal · 2020-09-18T21:00:32Z

@numerology
Thank you for your response.
I believe using @func_to_container_op will be very inconvenient to me because I have exotic dependencies in this step, bunch of bash scripts and probably something else. All this stuff is already dockerized and works fine.

I checked documentation. Docker image for @func_to_container_op is meant to be base image. I probably can implement python function that will find my entry point in base image and start process, but it looks as weird workaround. Is there a way to get big file sharing without using @func_to_container_op?

numerology · 2020-09-18T22:50:58Z

@timofal

Another way is perhaps to write a component yaml spec which refers to your docker image and use similar placeholders there.

See examples in our first party components:

pipelines/components/gcp/automl/create_dataset_for_tables/component.yaml

Line 141 in 52d5995

- {outputPath: dataset_path}

Ark-kun added the area/sdk/components label Sep 24, 2019

Ark-kun requested review from kevinbache, gaoning777, numerology and hongye-sun September 24, 2019 17:58

Ark-kun assigned kevinbache, gaoning777, numerology and hongye-sun Sep 24, 2019

k8s-ci-robot added the size/L label Sep 24, 2019

Ark-kun mentioned this pull request Sep 24, 2019

Binary output artifacts getting encoded as string #2223

Closed

k8s-ci-robot added the lgtm label Sep 25, 2019

k8s-ci-robot added the approved label Sep 25, 2019

kevinbache reviewed Sep 25, 2019

View reviewed changes

k8s-ci-robot merged commit 3caba4e into kubeflow:master Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK - Lightweight - Added support for file outputs #2221

SDK - Lightweight - Added support for file outputs #2221

Ark-kun commented Sep 24, 2019 •

edited by jlewi

Loading

numerology commented Sep 24, 2019

Ark-kun commented Sep 24, 2019

Ark-kun commented Sep 24, 2019

numerology commented Sep 25, 2019

Ark-kun commented Sep 25, 2019

k8s-ci-robot commented Sep 25, 2019

kevinbache commented Sep 25, 2019

kevinbache Sep 25, 2019

Ark-kun Sep 25, 2019

timofal commented Sep 18, 2020

numerology commented Sep 18, 2020

timofal commented Sep 18, 2020

numerology commented Sep 18, 2020

SDK - Lightweight - Added support for file outputs #2221

SDK - Lightweight - Added support for file outputs #2221

Conversation

Ark-kun commented Sep 24, 2019 • edited by jlewi Loading

numerology commented Sep 24, 2019

Ark-kun commented Sep 24, 2019

Ark-kun commented Sep 24, 2019

numerology commented Sep 25, 2019

Ark-kun commented Sep 25, 2019

k8s-ci-robot commented Sep 25, 2019

kevinbache commented Sep 25, 2019

kevinbache Sep 25, 2019

Choose a reason for hiding this comment

Ark-kun Sep 25, 2019

Choose a reason for hiding this comment

timofal commented Sep 18, 2020

numerology commented Sep 18, 2020

timofal commented Sep 18, 2020

numerology commented Sep 18, 2020

Ark-kun commented Sep 24, 2019 •

edited by jlewi

Loading