Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tfx samples without gcs. #19

Closed
nidhidamodaran opened this issue Mar 22, 2019 · 20 comments
Closed

Tfx samples without gcs. #19

nidhidamodaran opened this issue Mar 22, 2019 · 20 comments

Comments

@nidhidamodaran
Copy link

Is tfx using kubeflow pipeline strictly tied with gcs access?

@neuromage
Copy link

Hi @nidhidamodaran,

No, it's not tied to GCS access at all. Kubeflow Pipelines itself is designed to be run on-premise as well as on GKE, so you shouldn't feel the need to use GCP at all. The example TFX chicago taxi pipeline on kubeflow does use GCP services, including GCS for storage, Dataflow for Beam jobs and Cloud ML Engine for training at scale. You can however easily remove these dependencies and run the pipeline on-premise, though scalability will be a problem without a distributed runner for Beam. Hope that helps. Let me know if you have any more questions.

@nidhidamodaran
Copy link
Author

Hi @neuromage thanks for the reply.

I was trying out chicage taxi sample to get hands-on with tfx.
Below is the sample i tried:

@PipelineDecorator(
pipeline_name='chicago_taxi_simple',
log_root='/var/tmp/tfx/logs',
enable_cache=True,
additional_pipeline_args={'logger_args': logger_overrides},
pipeline_root=_pipeline_root)

def _create_pipeline():
examples = csv_input(_data_root)

example_gen = CsvExampleGen(input_base=examples)
statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples)

infer_schema = SchemaGen(stats=statistics_gen.outputs.output)

validate_stats = ExampleValidator(
  stats=statistics_gen.outputs.output, schema=infer_schema.outputs.output)

return [
  example_gen, statistics_gen, infer_schema, validate_stats]

pipeline = KubeflowRunner().run(_create_pipeline())

When i run the pipeline, pod creation fails with error :

Unable to mount volumes for pod "chicago-taxi-simple-wwxmc-4030983742_kubeflow(438e1b67-4c73-11e9-a7e6-0273ce6a77d4)": timeout expired waiting for volumes to attach or mount for pod "kubeflow"/"chicago-taxi-simple-wwxmc-4030983742". list of unmounted volumes=[gcp-credentials]. list of unattached volumes=[podmetadata docker-lib docker-sock gcp-credentials pipeline-runner-token-gk4d7]

Could you help me understand what I am doing wrong in here.

@neuromage
Copy link

Are you running this in a Kubeflow cluster in GCP? It looks like that's not the case. The error indicates that it was unable to mount the GCP credentials.

I think this brings up an important issue though. We should allow non-gcp usage of the components, and so mounting of GCP credentials should be user-configurable. I'll work on fixing this. In the meantime, if you're truly running on-prem, you can change the following line of code to remove the apply call to not use GCP authentication:

).apply(gcp.use_gcp_secret('user-gcp-sa')) # Adds GCP authentication.

@nidhidamodaran
Copy link
Author

yes @neuromage I will try that. Thanks.

@nidhidamodaran
Copy link
Author

Also, is there any option of using custom image for different pipeline stages in tfx?

@neuromage
Copy link

Also, is there any option of using custom image for different pipeline stages in tfx?

Right now, this isn't possible without some work. You'd probably need to write the pipeline using Kubeflow PIpelines SDK instead, which would let you insert custom images/steps into your pipeline. However, this isn't straightforward, as you need to figure out how to pass around the metadata artifacts and use it in your custom step. I am planning to enable this use-case soon though, and will document it as a sample within the Kubeflow Pipelines repo when it's done.

/cc @krazyhaas
/cc @zhitaoli
/cc @ruoyu90

@MattMorgis
Copy link
Contributor

You’d probably need to write the pipeline using Kubeflow PIpelines SDK instead. … However, this isn’t straightforward, as you need to figure out how to pass around the metadata artifacts.

I'd love to add that to the docs or maybe even the README.md on the tfx examples in the Kubeflow Pipelines repo.

We had found this PR and was wondering if it was possible already possible to pass these artifacts around, if that was encouraged, or if to use the metadata store the pipeline could only be constructed of tfx components. Your comment seems to answer all of these questions!

Thanks for all of these detailed responses @neuromage, they have been super helpful in connecting some dots while getting started.

@neuromage
Copy link

Thanks @MattMorgis. Yeah, right now, we record those artifacts and TFX components know how to pass them around in a Kubeflow pipeline, but we haven't made this easily accessible by custom components just yet. I am planning on enabling this over the next few weeks. I'll update this thread with more info then.

@MattMorgis
Copy link
Contributor

We made some progress running this on AWS with Kubeflow, but we just hit one snag that is going to take a bit to overcome:

ValueError: Unable to get the Filesystem for path s3://<bucket>/data.csv

It's interesting because it is successfully connecting to S3 to read the filename, data.csv. We simply specify the bucket.

However, I think the error that is raised is related to Apache Beams' Python SDK not having an S3 FileSystem: https://issues.apache.org/jira/browse/BEAM-2572

@krazyhaas
Copy link

However, I think the error that is raised is related to Apache Beams' Python SDK not having an S3 FileSystem: https://issues.apache.org/jira/browse/BEAM-2572

That's correct. Until beam's python SDK supports S3, we can't run most of the TFX libraries on S3. We have a similar challenge with Azure Blob Storage.

@MattMorgis
Copy link
Contributor

I've been working on it. I'm about 50% complete and working with the Beam project/team to get it merged.

According to the ticket there is a Google Summer of Code student who may do the Azure Blob Storage file system as well.

@zhitaoli
Copy link
Contributor

zhitaoli commented Apr 17, 2019 via email

@MattMorgis
Copy link
Contributor

That is a very good point @zhitaoli. We realized Tensorflow itself had S3 support, and it was able to find the CSV file in the bucket we were pointing to, however we then ran into the Beam unsupported S3 file system error.

I didn't realize Azure Blob Storage wasn't supported in Tensorflow itself either, in addition to Beam. I'll mention that in the ticket.

ruoyu90 pushed a commit to ruoyu90/tfx that referenced this issue Aug 28, 2019
@karlschriek
Copy link

karlschriek commented Nov 14, 2019

Looks like Beam support for S3 is close to being implemented (see https://issues.apache.org/jira/browse/BEAM-2572 and apache/beam#9955).

I would just like to second what has been discussed here. There is a pretty large user community who are interested in TFX and/or Kubeflow but are currently struggling to get into those frameworks due to a lack of non-GCP examples (and sometimes core functionality).

A TFX Chicaco Taxi example on Kubeflow for AWS/Azure/On-prem would be a great starting point for those of us who are currently not on GCP!

@ucdmkt
Copy link
Contributor

ucdmkt commented Nov 14, 2019

taxi_pipeline_kubeflow_local.py does not depend on GCO or GKE, but only depend on Kubeflow Pipelines deployment (backend) on any k8s.

@zhitaoli
Copy link
Contributor

zhitaoli commented Jan 27, 2020

To capture the discussion: TFX examples can already use HDFS and GCS (although we don't have example for former), and after Beam 2.18 is picked up there is also S3 support.

Azure blob storage support is tracked in Beam side.

We will file a separate feature request (#1185 ) for using separate images in each stage and discuss it there.

We will close this issue as won't fix. Please let us know if you think otherwise.

@krazyhaas
Copy link

@zhitaoli I thought the beam fix won't be integrated until 2.19 per last update from Pablo. Can you confirm it will be available in 2.18?

@gowthamkpr
Copy link

@zhitaoli Can you please take a look at the above comment. Thanks!

@hanneshapke
Copy link

@zhitaoli @gowthamkpr With Apache Beam 2.19, we still get

ValueError: Unable to get the Filesystem for path s3://<bucket>/test-data/*

Are there some tfx dependencies? We use tfx 0.21.

@MattMorgis
Copy link
Contributor

I think TFX installs Apache Beam with the gcp extras, and to use S3 you'll need to install with the aws ones. I also think the only way to do that right now is to rebuild from source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants