Skip to content

Commit

Permalink
Update component.yaml for dataflow and cmle with recent changes. (#987)
Browse files Browse the repository at this point in the history
* Update component.yaml for dataflow and cmle with recent changes.

* Add type information in the GCP component yaml files

* Fix typo in component yaml
  • Loading branch information
hongye-sun authored and k8s-ci-robot committed Mar 20, 2019
1 parent e09a8ff commit 7e45dda
Show file tree
Hide file tree
Showing 14 changed files with 370 additions and 76 deletions.
39 changes: 31 additions & 8 deletions components/gcp/bigquery/query/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,38 @@ name: Bigquery - Query
description: |
Submit a query to Bigquery service and write outputs to a GCS blob.
inputs:
- {name: query, description: 'The query used by Bigquery service to fetch the results.'}
- {name: project_id, description: 'The project to execute the query job.' }
- {name: dataset_id, description: 'The ID of the persistent dataset to keep the results of the query.', default: '' }
- {name: table_id, description: 'The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table.', default: '' }
- {name: output_gcs_path, description: 'The GCS blob path to dump the query results to.', default: '' }
- {name: dataset_location, description: 'The location to create the dataset. Defaults to `US`.', default: 'US' }
- {name: job_config, description: 'The full config spec for the query job.', default: '' }
- name: query
description: 'The query used by Bigquery service to fetch the results.'
type: String
- name: project_id
description: 'The project to execute the query job.'
type: GCPProjectID
- name: dataset_id
description: 'The ID of the persistent dataset to keep the results of the query.'
default: ''
type: String
- name: table_id
description: >-
The ID of the table to keep the results of the query. If absent, the operation
will generate a random id for the table.
default: ''
type: String
- name: output_gcs_path
description: 'The GCS blob path to dump the query results to.'
default: ''
type: GCSPath
- name: dataset_location
description: 'The location to create the dataset. Defaults to `US`.'
default: 'US'
type: String
- name: job_config
description: 'The full config spec for the query job.'
default: ''
type: Dict
outputs:
- {name: output_gcs_path, description: 'The GCS blob path to dump the query results to.'}
- name: output_gcs_path
description: 'The GCS blob path to dump the query results to.'
type: GCSPath
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand Down
40 changes: 30 additions & 10 deletions components/gcp/dataflow/launch_python/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,25 +16,45 @@ name: Launch Python
description: |
Launch a self-executing beam python file.
inputs:
- {name: python_file_path, description: 'The gcs or local path to the python file to run.'}
- {name: project_id, description: 'The ID of the parent project.' }
- {name: requirements_file_path, description: 'Optional, the gcs or local path to the pip requirements file', default: '' }
- {name: location, description: 'The regional endpoint to which to direct the request.', default: '' }
- {name: job_name_prefix, description: 'Optional. The prefix of the genrated job name. If not provided, the method will generated a random name.', default: '' }
- {name: args, description: 'The list of args to pass to the python file.', default: '[]' }
- {name: wait_interval, default: '30', description: 'Optional wait interval between calls to get job status. Defaults to 30.' }
- name: python_file_path
description: 'The gcs or local path to the python file to run.'
type: String
- name: project_id
description: 'The ID of the parent project.'
type: GCPProjectID
- name: staging_dir
description: >-
Optional. The GCS directory for keeping staging files.
A random subdirectory will be created under the directory to keep job info
for resuming the job in case of failure and it will be passed as
`staging_location` and `temp_location` command line args of the beam code.
default: ''
type: GCSPath
- name: requirements_file_path
description: 'Optional, the gcs or local path to the pip requirements file'
default: ''
type: GCSPath
- name: args
description: 'The list of args to pass to the python file.'
default: '[]'
type: List
- name: wait_interval
default: '30'
description: 'Optional wait interval between calls to get job status. Defaults to 30.'
type: Integer
outputs:
- {name: job_id, description: 'The id of the created dataflow job.'}
- name: job_id
description: 'The id of the created dataflow job.'
type: String
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
args: [
kfp_component.google.dataflow, launch_python,
--python_file_path, {inputValue: python_file_path},
--project_id, {inputValue: project_id},
--staging_dir, {inputValue: staging_dir},
--requirements_file_path, {inputValue: requirements_file_path},
--location, {inputValue: location},
--job_name_prefix, {inputValue: job_name_prefix},
--args, {inputValue: args},
--wait_interval, {inputValue: wait_interval}
]
Expand Down
49 changes: 40 additions & 9 deletions components/gcp/dataflow/launch_template/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,46 @@ name: Launch Dataflow Template
description: |
Launchs a dataflow job from template.
inputs:
- {name: project_id, description: 'Required. The ID of the Cloud Platform project that the job belongs to.'}
- {name: gcs_path, description: 'Required. A Cloud Storage path to the template from which to create the job. Must be valid Cloud Storage URL, beginning with `gs://`.' }
- {name: launch_parameters, description: 'Parameters to provide to the template being launched. Schema defined in https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters. `jobName` will be replaced by generated name.' }
- {name: location, description: 'The regional endpoint to which to direct the request.', default: '' }
- {name: job_name_prefix, description: 'Optional. The prefix of the genrated job name. If not provided, the method will generated a random name.', default: '' }
- {name: validate_only, description: 'If true, the request is validated but not actually executed. Defaults to false.', default: 'False' }
- {name: wait_interval, description: 'Optional wait interval between calls to get job status. Defaults to 30.', default: '30'}
- name: project_id
description: 'Required. The ID of the Cloud Platform project that the job belongs to.'
type: GCPProjectID
- name: gcs_path
description: >-
Required. A Cloud Storage path to the template from
which to create the job. Must be valid Cloud Storage URL, beginning with `gs://`.
type: GCSPath
- name: launch_parameters
description: >-
Parameters to provide to the template being launched. Schema defined in
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters.
`jobName` will be replaced by generated name.'
type: Dict
default: '{}'
- name: location
description: 'The regional endpoint to which to direct the request.'
default: ''
type: GCPRegion
- name: validate_only
description: >-
If true, the request is validated but not actually executed. Defaults to false.
default: 'False'
type: Bool
- name: staging_dir
description: >-
Optional. The GCS directory for keeping staging files.
A random subdirectory will be created under the directory to keep job info
for resuming the job in case of failure.
default: ''
type: GCSPath
- name: wait_interval
description: >-
Optional wait interval between calls to get job status. Defaults to 30.
default: '30'
type: Integer
outputs:
- {name: job_id, description: 'The ID of the created dataflow job.'}
- name: job_id
description: 'The id of the created dataflow job.'
type: String
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand All @@ -34,8 +65,8 @@ implementation:
--gcs_path, {inputValue: gcs_path},
--launch_parameters, {inputValue: launch_parameters},
--location, {inputValue: location},
--job_name_prefix, {inputValue: job_name_prefix},
--validate_only, {inputValue: validate_only},
--staging_dir, {inputValue: staging_dir},
--wait_interval, {inputValue: wait_interval},
]
env:
Expand Down
13 changes: 12 additions & 1 deletion components/gcp/dataproc/create_cluster/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,39 +19,50 @@ inputs:
- name: project_id
description: >-
Required. The ID of the Google Cloud Platform project that the cluster belongs to.
type: GCPProjectID
- name: region
description: 'Required. The Cloud Dataproc region in which to handle the request.'
type: GCPRegion
- name: name
description: >-
Optional. The cluster name. Cluster names within a project must be unique. Names of
deleted clusters can be reused
default: ''
type: String
- name: name_prefix
description: 'Optional. The prefix of the cluster name.'
default: ''
type: String
- name: initialization_actions
description: >-
Optional. List of GCS URIs of executables to execute on each node after config
is completed. By default, executables are run on master and all worker nodes.
default: ''
type: List
- name: config_bucket
description: >-
Optional. A Google Cloud Storage bucket used to stage job dependencies, config
files, and job driver console output.
default: ''
type: GCSPath
- name: image_version
description: 'Optional. The version of software inside the cluster.'
default: ''
type: String
- name: cluster
description: >-
Optional. The full cluster config. See
[full details](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster)
default: ''
type: Dict
- name: wait_interval
default: '30'
description: 'Optional. The wait seconds between polling the operation. Defaults to 30.'
type: Integer
outputs:
- {name: cluster_name, description: 'The cluster name of the created cluster.'}
- name: cluster_name
description: 'The cluster name of the created cluster.'
type: String
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand Down
4 changes: 4 additions & 0 deletions components/gcp/dataproc/delete_cluster/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,18 @@ inputs:
- name: project_id
description: >-
Required. The ID of the Google Cloud Platform project that the cluster belongs to.
type: GCPProjectID
- name: region
description: >-
Required. The Cloud Dataproc region in which to handle the request.
type: GCPRegion
- name: name
description: 'Required. The cluster name to delete.'
type: String
- name: wait_interval
default: '30'
description: 'Optional. The wait seconds between polling the operation. Defaults to 30.'
type: Integer
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand Down
13 changes: 12 additions & 1 deletion components/gcp/dataproc/submit_hadoop_job/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,47 +21,58 @@ inputs:
description: >-
Required. The ID of the Google Cloud Platform project that the cluster
belongs to.
type: GCPProjectID
- name: region
description: >-
Required. The Cloud Dataproc region in which to handle the request.
type: GCPRegion
- name: cluster_name
description: 'Required. The cluster to run the job.'
type: String
- name: main_jar_file_uri
default: ''
description: >-
The HCFS URI of the jar file containing the main class. Examples:
`gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar`
`hdfs:/tmp/test-samples/custom-wordcount.jar`
`file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar`
type: GCSPath
- name: main_class
default: ''
description: >-
The name of the driver's main class. The jar file
containing the class must be in the default CLASSPATH or specified
in `jarFileUris`.
type: String
- name: args
default: ''
description: >-
Optional. The arguments to pass to the driver. Do not include
arguments, such as -libjars or -Dfoo=bar, that can be set as job properties,
since a collision may occur that causes an incorrect job submission.
type: List
- name: hadoop_job
default: ''
description: >-
Optional. The full payload of a
[hadoop job](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob).
type: Dict
- name: job
default: ''
description: >-
Optional. The full payload of a
[Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).
type: Dict
- name: wait_interval
default: '30'
description: >-
Optional. The wait seconds between polling the operation.
Defaults to 30.
type: Integer
outputs:
- {name: job_id, description: 'The ID of the created job.'}
- name: job_id
description: 'The ID of the created job.'
type: String
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand Down
15 changes: 13 additions & 2 deletions components/gcp/dataproc/submit_hive_job/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,43 +20,54 @@ inputs:
description: >-
Required. The ID of the Google Cloud Platform project that the cluster
belongs to.
type: GCPProjectID
- name: region
description: >-
Required. The Cloud Dataproc region in which to handle the request.
type: GCPRegion
- name: cluster_name
description: 'Required. The cluster to run the job.'
type: String
- name: queries
default: ''
description: >-
Required. The queries to execute. You do not need to
terminate a query with a semicolon. Multiple queries can be specified
in one string by separating each with a semicolon.
type: List
- name: query_file_uri
default: ''
description: >-
The HCFS URI of the script that contains Hive queries.
type: GCSPath
- name: script_variables
default: ''
description: >-
Optional. Mapping of query variable names to
values (equivalent to the Hive command: SET name="value";).
type: Dict
- name: hive_job
default: ''
description: >-
Optional. The full payload of a
[HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob)
[HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob).
type: Dict
- name: job
default: ''
description: >-
Optional. The full payload of a
[Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).
type: Dict
- name: wait_interval
default: '30'
description: >-
Optional. The wait seconds between polling the operation.
Defaults to 30.
type: Integer
outputs:
- {name: job_id, description: 'The ID of the created job.'}
- name: job_id
description: 'The ID of the created job.'
type: String
implementation:
container:
image: gcr.io/ml-pipeline/ml-pipeline-gcp:latest
Expand Down
Loading

0 comments on commit 7e45dda

Please sign in to comment.