-
Notifications
You must be signed in to change notification settings - Fork 6.6k
BigQuery: Moves BigQuery tutorial for Dataproc to python-docs-samples #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Moves BigQuery tutorial for Dataproc to python-docs-samples #1494
Conversation
|
||
|
||
def run_natality_tutorial(): | ||
# [START bigquery_query_natality_tutorial] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do we want to own this as a BigQuery sample or should it go under Dataproc?
I'm leaning towards BigQuery, too, since there doesn't appear to be anything Dataproc-specific here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it as BigQuery and didn't include Dataproc in the name because there's nothing Dataproc-related in this sample (except the docstring, which we could move into the docs instead of the sample) and we could potentially leverage this sample for other purposes.
|
||
In the code below, the following actions are taken: | ||
* A new dataset is created "natality_regression." | ||
* A new table "regression_input" is created to hold the inputs for our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to actually create the table, right? The query job should be able to do that for us, so long as the configuration.query.createDisposition
BigQuery job property is correct.
|
||
# In the new BigQuery dataset, create a new table. | ||
table_ref = dataset.table('regression_input') | ||
# The table needs a schema before it can be created and accept data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think the query job will run just fine without a schema, since BigQuery knows all they correct types when it runs the query.
SELECT weight_pounds, mother_age, father_age, gestation_weeks, | ||
weight_gain_pounds, apgar_5min | ||
FROM `bigquery-public-data.samples.natality` | ||
WHERE weight_pounds is not null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use consistent casing for SQL keywords AND
IS
NOT
NULL
.
job_config.destination = table_ref | ||
|
||
# BigQuery can auto-detect the schema based on the source table. | ||
job_config.autodetect = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QueryJobConfig doesn't have an autodetect
parameter. This line is not needed.
Sorry if my previous comment was confusing. I mean that when BigQuery runs a query, it knows the types of the various columns in the query results based on the query text, so no explicit schema is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops removed the nonsense attribute
Adds BigQuery portion of Dataproc tutorial to this repo so it can be regularly tested (previously hard coded in the docs).