Skip to content

BigQuery: Moves BigQuery tutorial for Dataproc to python-docs-samples #1494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 18, 2018

Conversation

alixhami
Copy link
Contributor

Adds BigQuery portion of Dataproc tutorial to this repo so it can be regularly tested (previously hard coded in the docs).

@alixhami alixhami requested a review from tswast May 18, 2018 02:00
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label May 18, 2018


def run_natality_tutorial():
# [START bigquery_query_natality_tutorial]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we want to own this as a BigQuery sample or should it go under Dataproc?

I'm leaning towards BigQuery, too, since there doesn't appear to be anything Dataproc-specific here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it as BigQuery and didn't include Dataproc in the name because there's nothing Dataproc-related in this sample (except the docstring, which we could move into the docs instead of the sample) and we could potentially leverage this sample for other purposes.


In the code below, the following actions are taken:
* A new dataset is created "natality_regression."
* A new table "regression_input" is created to hold the inputs for our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to actually create the table, right? The query job should be able to do that for us, so long as the configuration.query.createDisposition BigQuery job property is correct.


# In the new BigQuery dataset, create a new table.
table_ref = dataset.table('regression_input')
# The table needs a schema before it can be created and accept data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think the query job will run just fine without a schema, since BigQuery knows all they correct types when it runs the query.

SELECT weight_pounds, mother_age, father_age, gestation_weeks,
weight_gain_pounds, apgar_5min
FROM `bigquery-public-data.samples.natality`
WHERE weight_pounds is not null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use consistent casing for SQL keywords AND IS NOT NULL.

job_config.destination = table_ref

# BigQuery can auto-detect the schema based on the source table.
job_config.autodetect = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryJobConfig doesn't have an autodetect parameter. This line is not needed.

Sorry if my previous comment was confusing. I mean that when BigQuery runs a query, it knows the types of the various columns in the query results based on the query text, so no explicit schema is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops removed the nonsense attribute

@alixhami alixhami merged commit c6b3914 into GoogleCloudPlatform:master May 18, 2018
@alixhami alixhami deleted the bq-natality-tutorial branch May 18, 2018 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants