pipeline step failed with exit status code 2: failed to save outputs #750

Svendegroote91 · 2019-01-29T20:41:02Z

Hi,

I am trying to run a very basic kubeflow pipeline with 2 components:
1/ preprocess
2/ train

However, when trying to run the pipeline, I get the message This step is in Error state with this message: failed to save outputs: exit status 2 in the Pipeline UI.

When I check the pod logs I get the following error and I attached the succesful logs before the error.

level=fatal msg="exit status 2\ngithub.com/argoproj/argo/errors.Wrap\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:87\ngithub.com/argoproj/argo/errors.InternalWrapError\n\t/root/go/src/github.com/argoproj/argo/errors/errors.go:70\ngithub.com/argoproj/argo/workflow/executor/docker.(*DockerExecutor).GetFileContents\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:40\ngithub.com/argoproj/argo/workflow/executor.(*WorkflowExecutor).SaveParameters\n\t/root/go/src/github.com/argoproj/argo/workflow/executor/executor.go:343\ngithub.com/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:49\ngithub.com/argoproj/argo/cmd/argoexec/commands.glob..func4\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:19\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:766\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/argoproj/argo/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/root/go/src/github.com/argoproj/argo/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/root/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2361"

My actual pipeline script is as follows:

import kfp.dsl as dsl


class Preprocess(dsl.ContainerOp):

  def __init__(self, name, data_bucket, cutoff_year):
    super(Preprocess, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/sven-sandbox/kubeflow/cpu:v3',
      command=['python3', 'run_preprocess.py'],
      arguments=[
        '--data_bucket', data_bucket,
        '--cutoff_year', cutoff_year
      ],
      file_outputs={'blob-path': 'data/data_stocks.csv'}
    )

class Train(dsl.ContainerOp):

  def __init__(self, name, blob_path, version, data_bucket, model_bucket):
    super(Train, self).__init__(
      name=name,
      # image needs to be a compile-time string
      image='gcr.io/sven-sandbox/kubeflow/cpu:v3',
      command=['python3', 'run_train.py'],
      arguments=[
        '--version', version,
        '--blob_path', blob_path,
        '--data_bucket', data_bucket,
        '--model_bucket', model_bucket
      ]
    )


@dsl.pipeline(
  name='financial time series',
  description='Train Financial Time Series'
)
def train_and_deploy(
        data_bucket=dsl.PipelineParam('data-bucket', value='kf-data'),
        cutoff_year=dsl.PipelineParam('cutoff-year', value='2010'),
        model_bucket=dsl.PipelineParam('model-bucket', value='kf-finance'),
        version=dsl.PipelineParam('version', value='8')
):
  """Pipeline to train financial time series model"""
  preprocess_op = Preprocess('preprocess', data_bucket, cutoff_year)
  train_op = Train('train and deploy', preprocess_op.output, version, data_bucket, model_bucket)


if __name__ == '__main__':
  import kfp.compiler as compiler
  compiler.Compiler().compile(train_and_deploy, __file__ + '.tar.gz')

The preprocess container is actually executed as the files on storage were stored but it looks like something is going wrong between the container communication and the orchestration.

As the error message is quite cryptic, can anyone help to tell me where to look to fix this issue?

FYI: this is my run_preprocess.py:

"""Module for running the data retrieval and preprocessing.

Scripts that performs all the steps to get the train and perform preprocessing.
"""
import logging
import argparse
import shutil
import os

from helpers import preprocess
from helpers import storage as storage_helper


def run_preprocess(args):
  """Runs the retrieval and preprocessing of the data.

  Args:
    args: args that are passed when submitting the training

  Returns:

  """
  tickers = ['snp', 'nyse', 'djia', 'nikkei', 'hangseng', 'ftse', 'dax', 'aord']
  closing_data = preprocess.load_data(tickers, args.cutoff_year)
  time_series = preprocess.preprocess_data(closing_data)
  temp_folder = 'data'
  if not os.path.exists(temp_folder):
    os.mkdir(temp_folder)
  file_path = os.path.join(temp_folder, 'data_{}.csv'.format(args.cutoff_year))
  time_series.to_csv(file_path, index=False)
  storage_helper.upload_to_storage(args.data_bucket, temp_folder)
  shutil.rmtree('data')


def main():
  parser = argparse.ArgumentParser(description='Preprocessing')

  parser.add_argument('--data_bucket',
                      type=str,
                      help='GCS bucket where preprocessed data is saved',
                      default='<your-bucket-name>')

  parser.add_argument('--cutoff_year',
                      type=str,
                      help='Cutoff year for the stock data',
                      default='2010')

  args = parser.parse_args()
  run_preprocess(args)


if __name__ == '__main__':
  logging.basicConfig(level=logging.INFO)
  main()

The text was updated successfully, but these errors were encountered:

ssbagalkar · 2019-02-27T19:33:46Z

I am getting exactly the same error.
Here is my pipeline code:

#!/usr/bin/env python3

import kfp.dsl as dsl
import kfp.gcp as gcp

class ObjectDict(dict):
    def __getattr__(self, name):
        if name in self:
            return self[name]
        else:
            raise AttributeError("No such attribute: " + name)


class DataAcquisitionOP(dsl.ContainerOp):

  def __init__(self, name, project, region, cluster_name):
    super(DataAcquisitionOP, self).__init__(
      name=name,
      image='gcr.io/blah_project/data-acquisition:latest',
      arguments=[
          '--project', project,
          '--region', region,
          '--cluster', cluster_name
     ],
     file_outputs={'train_img': '/tmp/data_acquisition/train_img.txt',
                   'train_labels':'tmp/data_acquisition/train_labels.txt',
                   'valid_img': '/tmp/data_acquisition/valid_img.txt',
                   'valid_labels':'/tmp/data_acquisition/valid_labels.txt',
                   'test_img': '/tmp/data_acquisition/test_img.txt',
                   'test_labels': '/tmp/data_acquisition/test_labels.txt',
                   })


class PreprocessOP(dsl.ContainerOp):

  def __init__(self, name, project, region, cluster_name, train_img, train_labels, valid_img, valid_labels, test_img, test_labels):
    super(PreprocessOP, self).__init__(
      name=name,
      image='gcr.io/blah_project/preprocess:latest',
      arguments=[
          '--project', project,
          '--region', region,
          '--cluster', cluster_name,
          '--train_img',train_img,
          '--train_labels',train_labels,
          '--valid_img',valid_img,
          '--valid_labels',valid_labels,
          '--test_img',test_img,
          '--test_labels',test_labels,
     ],
     file_outputs={'train_records': '/tmp/preprocess/locations/output_train.txt',
                   'test_records': '/tmp/preprocess/locations/output_test.txt',
                   'valid_records': '/tmp/preprocess/locations/output_valid.txt',
                   })

# =======================================================================

@dsl.pipeline(
  name='MNIST trainer',
  description='A trainer that does end-to-end distributed training for mnist models.'
)
def mnist_train_pipeline(
    project=dsl.PipelineParam('project',value='blah_project'),
    train_bucket=dsl.PipelineParam('train_bucket',value='gs://kfp-mnist/train'),
    valid_bucket=dsl.PipelineParam('valid_bucket',value='gs://kfp-mnist/valid'),
    test_bucket=dsl.PipelineParam('test_bucket', value='gs://kfp-mnist/test'),
    region=dsl.PipelineParam('region',value = 'us-central1'),
    cluster_name=dsl.PipelineParam('cluster_name',value='mnisttwo'),
):
    dataacquisition_op = DataAcquisitionOP('dataacquisition', project, region, cluster_name).apply(gcp.use_gcp_secret('user-gcp-sa'))
    preprocess_op = PreprocessOP('preprocess', project, region, cluster_name,
                                 dataacquisition_op.outputs['train_img'],
                                 dataacquisition_op.outputs['train_labels'],
                                 dataacquisition_op.outputs['valid_img'],
                                 dataacquisition_op.outputs['valid_labels'],
                                 dataacquisition_op.outputs['test_img'],
                                 dataacquisition_op.outputs['test_labels']).apply(gcp.use_gcp_secret('user-gcp-sa'))



if __name__ == '__main__':
    import kfp.compiler as compiler
    import sys

    if len(sys.argv) != 2:
        sys.exit(-1)

    filename = sys.argv[1]
    compiler.Compiler().compile(mnist_train_pipeline,'mnist_main'+ '.tar.gz')

ssbagalkar · 2019-02-27T21:37:21Z

So my issue was i had specified wrong path, specifically tmp/preprocess/locations/output_train.txt was specified as tmp/preprocess/output_train.txt when i saved it in my docker local. Changing it solved the problem

DurivetMatthias · 2019-03-04T09:25:13Z

Hey,

I fixed the pipeline and it now runs run_preprocess.py >> run_train.py.

The problem was in the run_preprocess.py file. You have to write a file to the local file path inside your docker container. In this case I wrote the file_path variable to /blob_path.txt. In the next step, run_train.py, this variable is used to find where the preprocessed data is stored.

with open("/blob_path.txt", "w") as output_file:
output_file.write(file_path)

Kubeflow pipelines manages its workflow by using file_outputs={'blob-path': '/blob_path.txt'} and preprocess_op.output. If you specify file_outputs, you basically sign a "contract" to write that file to the local path inside the docker container. If a second image then uses <your_previous_dsl.ContainerOp>.output you say that that image only starts when the previous "contract" has been completed, or in this example the blob_path variable has been written to /blob_path.txt

A small pitfall to keep in mind is that inside the code of the second container, run_train.py in this case, the variable <your_previous_dsl.ContainerOp>.output holds the content of the file instead of the filepath.

I really only added the write /blob_path.txt lines, if you want my code its at https://github.com/DurivetMatthias/examples/blob/add_pipelines/financial_time_series/tensorflow_model

I wonder whether this way of chaining output to input is intended to pass parameters or small values to the next container OR if kubeflow wants you to pass the entire preprocessed dataset.

vincent-pli · 2019-03-18T00:26:46Z

@DurivetMatthias Could you share your workflow yaml file?
I hit the same issue, the "wait" container said the path of my output stuff is not found,
I noticed in the workflow yaml file, the output stuff(/blob_path.txt in your case) is set to be parameter, so I guess there must be something wrong there.

DurivetMatthias · 2019-03-18T08:55:59Z

@vincent-pli the code is at https://github.com/DurivetMatthias/examples/blob/add_pipelines/financial_time_series/tensorflow_model/ml_pipeline.py

Make sure that when the preprocessing script ends, there is a file at the specified path in output_files.
(also make sure you are not removing the file just before ending with something like shutil.rmtree('data')).
Kubeflow only checks for promised file_outputs files when the container is done but before it is cleaned up/removed.

vincent-pli · 2019-03-20T11:59:27Z

@DurivetMatthias thanks, it helped

Svendegroote91 · 2019-03-23T19:23:21Z

Thanks @DurivetMatthias for the fix, closing the issue.

DurivetMatthias · 2019-11-23T06:52:25Z

Hey, Seems like kfp is expecting /mainctrfs/data to be a file instead of a directory, There's a bunch of code that I can't see from here but I would advise you to log the file structure of your docker container just before the step fails, then you should notice that atleast 1 file isn't where you promised it would be. (try os.system("ls -l")) Hope you find the problem :) Op di 12 nov. 2019 om 11:34 schreef RochanMehrotra <notifications@github.com

…

: @DurivetMatthias <https://github.com/DurivetMatthias> @Svendegroote91 <https://github.com/Svendegroote91> I'm getting somewhat same error: This step is in Error state with this message: failed to save outputs: read /mainctrfs/data: is a directory My pipeline is : import kfp def generate_data(output_uri, output_uri_in_file, volume, step_name='generate_data', mount_output_to='/data'): return kfp.dsl.ContainerOp( name=step_name, image='rochanmehrotra/testing_kf:generate_data', arguments=[ '--output1-path', output_uri, '--output1-path-file', output_uri_in_file, ], command=['python3', '/component/src/data_generator.py'], file_outputs={ 'output_file': output_uri, 'output_uri_in_file': output_uri_in_file, 'xtrain':"/data/x_train.npy", 'ytrain':"/data/y_train.npy", 'xtest':"/data/x_test.npy", 'ytest':"/data/y_test.npy" }, pvolumes={mount_output_to: volume} ) def train(output_uri, output_uri_in_file, volume, step_name='train', mount_output_to='/data'): return kfp.dsl.ContainerOp( name=step_name, image='rochanmehrotra/testing_kf:train', arguments=[ '--model-path', output_uri, '--output1-path-file', output_uri_in_file, ], command=['python3', '/component/src/train.py'], file_outputs={ 'output_file': output_uri, 'output_uri_in_file': output_uri_in_file, }, pvolumes={mount_output_to: volume} ) def evaluate(output_uri, output_uri_in_file, volume, step_name='evaluate', mount_output_to='/data'): return kfp.dsl.ContainerOp( name=step_name, image='rochanmehrotra/testing_kf:evaluate', arguments=[ '--model-path', output_uri, '--output1-path-file', output_uri_in_file, ], command=['python3', '/component/src/evaluate.py'], file_outputs={ 'output_file': output_uri, 'output_uri_in_file': output_uri_in_file, }, pvolumes={mount_output_to: volume} ) @kfp.dsl.pipeline(name='mlp pipeline', description='') def mlp_pipeline( rok_url, pvc_size='4Gi'): vop = kfp.dsl.VolumeOp( name='create-volume', resource_name='mlp_pipeline', annotations={"rok/origin": rok_url}, size=pvc_size ) component_1 = generate_data( output_uri='/data', output_uri_in_file='/data', volume=vop.volume ) component_2 = train( #output_uri='/data', #output_uri_in_file='/data/output1_path_file', output_uri=component_1.outputs['output_file'], output_uri_in_file=component_1.outputs['output_uri_in_file'], volume=vop.volume ).after(component_1) component_3 = evaluate( output_uri='/data/model.h5', output_uri_in_file='/data/output1_path_file', volume=vop.volume ).after(component_2) if __name__ == '__main__': import kfp.compiler as compiler compiler.Compiler().compile(mlp_pipeline, 'mlp_pipeline.tar.gz') the code for generatedata.py/rochanmehrotra/testing_kf:generate_data: import numpy as np from keras.models import Sequential from keras.layers import Dense, Dropout import argparse from pathlib import Path import os parser = argparse.ArgumentParser(description='My program description') parser.add_argument('--output1-path', type=str, help='Path of the local file or GCS blob where the Output 1 data should be written.') parser.add_argument('--output1-path-file', type=str, help='Path of the local file where the Output 1 URI data should be written.') args = parser.parse_args() dir_path = os.path.dirname(os.path.realpath(__file__)) print("dir_path=>",dir_path) if not os.path.exists(args.output1_path): os.mkdir(args.output1_path) if not os.path.exists(args.output1_path_file): os.mkdir(args.output1_path_file) # Generate dummy data x_train = np.random.random((1000, 20)) y_train = np.random.randint(2, size=(1000, 1)) x_test = np.random.random((100, 20)) y_test = np.random.randint(2, size=(100, 1)) np.save(args.output1_path+"/x_train.npy",x_train) np.save(args.output1_path+"/y_train.npy",y_train) np.save(args.output1_path+"/x_test.npy",x_test ) np.save(args.output1_path+"/y_test.npy",y_test ) paths="{}/x_train.npy \n{}/y_train.npy \n{}/x_test.npy \n{}/y_test.npy".format(args.output1_path,args.output1_path,args.output1_path,args.output1_path) file1 = open(args.output1_path+"/output1_path_file","a") file1.write(paths) file1.close() from os import walk f = [] for (dirpath, dirnames, filenames) in walk(args.output1_path): print(dirpath) print(dirnames) f.extend(filenames) print(f) the output is of generate data is : Using TensorFlow backend. dir_path=> /component/src /data ['lost+found'] /data/lost+found [] ['x_test.npy', 'output1_path_file', 'x_train.npy', 'y_train.npy', 'y_test.npy'] but it says stage failed with error: This step is in Error state with this message: failed to save outputs: read /mainctrfs/data: is a directory``` — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#750?email_source=notifications&email_token=AGJPCF3VXRXAHPDALJDXSMDQTKBDBA5CNFSM4GTDSI7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZZPSQ#issuecomment-552835018>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGJPCF5EJPSFC2AEYMEL6VDQTKBDBANCNFSM4GTDSI7A> .

…kubeflow#750 (kubeflow#814) * wip * fix cleanup job * fix cleanup cronjob permission * add cleanup job for kf-ci-v1 cluster * update namespace

…eflow#750) * Add bert triton example * fix init * Triton reference * Rename to triton * Add bert transformer to e2e image build * Generate for triton sdk types * Add to bert transformer build to CI * Add e2e test for triton * Add back license * Add gpu annotation on test * Fix storage uri in the test * upgrade test cluster * Add debug for the pod * Fix test resource * Add retry * Test bert model with triton * Skip triton test * Skip tag resolution for nvcr.io * Use simpler example * skip tag resolution for nvcr.io * Upgrade to Knative 0.15.0 for e2e testing * Upgrade to kubernetes 1.16 * Add note for knative version dependency

Some of the recent PRs. kubeflow#750 kubeflow#709 kubeflow#706 kubeflow#633 kubeflow#636 kubeflow#638 kubeflow#646

Svendegroote91 changed the title ~~pipeline step failed with exit status code 1~~ pipeline step failed with exit status code 2 Jan 29, 2019

Svendegroote91 changed the title ~~pipeline step failed with exit status code 2~~ pipeline step failed with exit status code 2: failed to save outputs Jan 29, 2019

paveldournov assigned Ark-kun Feb 4, 2019

paveldournov added area/sdk/components/python_container_op investigate labels Feb 4, 2019

Svendegroote91 closed this as completed Mar 23, 2019

xybuty mentioned this issue Mar 24, 2021

failed to save outputs: fork/exec /usr/local/bin/docker: exec format error #5347

Closed

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Support data passing pipeline. (kubeflow#750)

cb7abf1

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024

Proposal: Adding ScrapCodes as approver. (kubeflow#754)

3f13201

Some of the recent PRs. kubeflow#750 kubeflow#709 kubeflow#706 kubeflow#633 kubeflow#636 kubeflow#638 kubeflow#646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline step failed with exit status code 2: failed to save outputs #750

pipeline step failed with exit status code 2: failed to save outputs #750

Svendegroote91 commented Jan 29, 2019 •

edited

Loading

ssbagalkar commented Feb 27, 2019 •

edited

Loading

ssbagalkar commented Feb 27, 2019

DurivetMatthias commented Mar 4, 2019 •

edited

Loading

vincent-pli commented Mar 18, 2019

DurivetMatthias commented Mar 18, 2019

vincent-pli commented Mar 20, 2019

Svendegroote91 commented Mar 23, 2019

DurivetMatthias commented Nov 23, 2019 via email

pipeline step failed with exit status code 2: failed to save outputs #750

pipeline step failed with exit status code 2: failed to save outputs #750

Comments

Svendegroote91 commented Jan 29, 2019 • edited Loading

ssbagalkar commented Feb 27, 2019 • edited Loading

ssbagalkar commented Feb 27, 2019

DurivetMatthias commented Mar 4, 2019 • edited Loading

vincent-pli commented Mar 18, 2019

DurivetMatthias commented Mar 18, 2019

vincent-pli commented Mar 20, 2019

Svendegroote91 commented Mar 23, 2019

DurivetMatthias commented Nov 23, 2019 via email

Svendegroote91 commented Jan 29, 2019 •

edited

Loading

ssbagalkar commented Feb 27, 2019 •

edited

Loading

DurivetMatthias commented Mar 4, 2019 •

edited

Loading