Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve doc for gcp components. #1049

Merged
merged 12 commits into from
Apr 4, 2019
143 changes: 97 additions & 46 deletions components/gcp/bigquery/query/README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,104 @@

# Bigquery - Query
# Submitting a query using BigQuery
A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob.

## Intended Use
A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob.
The component is intended to export query data from BiqQuery service to Cloud Storage.

## Runtime arguments
Name | Description | Data type | Optional | Default
:--- | :---------- | :-------- | :------- | :------
query | The query used by Bigquery service to fetch the results. | String | No |
project_id | The project to execute the query job. | GCPProjectID | No |
dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` `
table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` `
output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` `
dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US`
job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` `


## Outputs
Name | Description | Type
:--- | :---------- | :---
output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath

## Cautions and requirements
To use the component, the following requirements must be met:
* BigQuery API is enabled
* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:

## Run-Time Parameters:
Name | Description
:--- | :----------
query | The query used by Bigquery service to fetch the results.
project_id | The project to execute the query job.
dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one.
table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table.
output_gcs_path | The GCS blob path to dump the query results to.
dataset_location | The location to create the dataset. Defaults to `US`.
job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details.
```python
bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))

## Output:
Name | Description
:--- | :----------
output_gcs_path | The GCS blob path to dump the query results to.
```

## Sample
* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project.
* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket.

Note: the sample code below works in both IPython notebook or python code directly.
## Detailed Description
The component does several things:
1. Creates persistent dataset and table if they do not exist.
1. Submits a query to BigQuery service and persists the result to the table.
1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format.

### Set sample parameters
Here are the steps to use the component in a pipeline:
1. Install KFP SDK
Install the SDK (Uncomment the code if the SDK is not installed before)


```python
# Required Parameters
PROJECT_ID = '<Please put your project ID here>'
GCS_WORKING_DIR = 'gs://<Please put your GCS path here>' # No ending slash
%%capture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is %%capture command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to hide the outputs from the cell. It's usually not quite useful to show pip install logs here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we still need errors to be output.
https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture shows a way to not capture stderr.
however, when I tried it in my notebook with !pip install, it does not work.

Copy link
Contributor

@Ark-kun Ark-kun Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, pip has the --quiet option.


KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.13/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
```

2. Load the component by DSL


```python
import kfp.components as comp

# Optional Parameters
EXPERIMENT_NAME = 'Bigquery -Query'
COMPONENT_SPEC_URI = 'https://raw.githubusercontent.com/kubeflow/pipelines/d2f5cc92a46012b9927209e2aaccab70961582dc/components/gcp/bigquery/query/component.yaml'
bigquery_query_op = comp.load_component_from_url(COMPONENT_SPEC_URI)
help(bigquery_query_op)
```

### Install KFP SDK
For more information about the component, please checkout:
* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)
* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)
* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)
* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)


### Sample

Note: The following sample code works in IPython notebook or directly in Python code.

In this sample, we send a query to get the top questions from stackdriver public data and output the data to a Cloud Storage bucket. Here is the query:


```python
# Install the SDK (Uncomment the code if the SDK is not installed before)
# KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.11/kfp.tar.gz'
# !pip3 install $KFP_PACKAGE --upgrade
QUERY = 'SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` LIMIT 10'
```

### Load component definitions
#### Set sample parameters


```python
import kfp.components as comp
# Required Parameters
PROJECT_ID = '<Please put your project ID here>'
GCS_WORKING_DIR = 'gs://<Please put your GCS path here>' # No ending slash
```

bigquery_query_op = comp.load_component_from_url(COMPONENT_SPEC_URI)
display(bigquery_query_op)

```python
# Optional Parameters
EXPERIMENT_NAME = 'Bigquery -Query'
OUTPUT_PATH = '{}/bigquery/query/questions.csv'.format(GCS_WORKING_DIR)
```

### Here is an illustrative pipeline that uses the component
#### Run the component as a single pipeline


```python
Expand All @@ -68,38 +110,40 @@ import json
description='Bigquery query pipeline'
)
def pipeline(
query,
project_id,
query=QUERY,
project_id = PROJECT_ID,
dataset_id='',
table_id='',
output_gcs_path='',
output_gcs_path=OUTPUT_PATH,
dataset_location='US',
job_config=''
):
bigquery_query_op(query, project_id, dataset_id, table_id, output_gcs_path, dataset_location,
job_config).apply(gcp.use_gcp_secret('user-gcp-sa'))
bigquery_query_op(
query=query,
project_id=project_id,
dataset_id=dataset_id,
table_id=table_id,
output_gcs_path=output_gcs_path,
dataset_location=dataset_location,
job_config=job_config).apply(gcp.use_gcp_secret('user-gcp-sa'))
```

### Compile the pipeline
#### Compile the pipeline


```python
pipeline_func = pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.tar.gz'
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
```

### Submit the pipeline for execution
#### Submit the pipeline for execution


```python
#Specify pipeline argument values
arguments = {
'query': 'SELECT * FROM `bigquery-public-data.stackoverflow.posts_questions` LIMIT 10',
'project_id': PROJECT_ID,
'output_gcs_path': '{}/bigquery/query/questions.csv'.format(GCS_WORKING_DIR)
}
arguments = {}

#Get or create an experiment and submit a pipeline run
import kfp
Expand All @@ -110,3 +154,10 @@ experiment = client.create_experiment(EXPERIMENT_NAME)
run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
```

#### Inspect the output


```python
gaoning777 marked this conversation as resolved.
Show resolved Hide resolved
!gsutil cat OUTPUT_PATH
```
12 changes: 8 additions & 4 deletions components/gcp/bigquery/query/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@

name: Bigquery - Query
description: |
Submit a query to Bigquery service and write outputs to a GCS blob.
A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery
service and dump outputs to a Google Cloud Storage blob.
inputs:
- name: query
description: 'The query used by Bigquery service to fetch the results.'
Expand All @@ -33,20 +34,23 @@ inputs:
default: ''
type: String
- name: output_gcs_path
description: 'The GCS blob path to dump the query results to.'
description: 'The path to the Cloud Storage bucket to store the query output.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, the word "bucket" is confusing here. You've probably meant "The path of GCS directory" or "The path of GCS file".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a suggested change from tech writer. I'd like to follow their suggestion to make it consistent with other AIHub docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.
We should probably still tell them that technically, a bucket is my-bucket while gs://my-bucket/some/dir/ is a GCS directory path, not a bucket.

default: ''
type: GCSPath
- name: dataset_location
description: 'The location to create the dataset. Defaults to `US`.'
default: 'US'
type: String
- name: job_config
description: 'The full config spec for the query job.'
description: >-
The full config spec for the query job.See
[QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig)
for details.
default: ''
type: Dict
outputs:
- name: output_gcs_path
description: 'The GCS blob path to dump the query results to.'
description: 'The path to the Cloud Storage bucket containing the query output in CSV format.'
type: GCSPath
implementation:
container:
Expand Down
Loading