-
Notifications
You must be signed in to change notification settings - Fork 4.4k
[BEAM-7034] Add example snippet to read fromQuery using BQ Storage API. #13083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Should we add a note about the pricing? Does using BigQuery Storage API + = BigQuery query pricing + BigQuery Storage API pricing on top? |
R: @kennknowles |
cc: @vachan-shetty |
pipeline | ||
.apply( | ||
"Read from BigQuery table", | ||
BigQueryIO.readTableRows() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid using readTableRows in an example snippet, both for the storage API and also for the existing export-based model -- this involves a needless conversion from Avro to JSON, where customers should instead be able to consume the Avro GenericRecords directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, agree. What would be prefered way to continue with this then?
- Finish this PR with using TableRows to have all 3 read examples using the same undesired
readTableRows()
call - refactor this example only to use
read<T>(SerializableFunction<SchemaAndRecord, T> f)
as a part of this PR - refactor all 3 examples using the preferred
read<T>(SerializableFunction<SchemaAndRecord, T> f)
?Reading from a table
Reading with a query string
Using the BigQuery Storage API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have the cycles, let's do (3). Otherwise, you can go ahead with (1) and I will take care of updating them when you're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's merge this, and next week I can refactor all 3 examples.
Also, re: the question above about pricing: the storage API is free when used to read anonymous tables (e.g. query results). Users pay only when scanning from a named table. |
@fpopic - Could you address the open comments? |
Let me understand on a small example. Does it mean that for my existing named table [
{
"mode": "NULLABLE",
"name": "my_string_field_1",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "my_string_field_2",
"type": "STRING"
}
]
Or you are just saying that anonymous table scan BigQueryIO
.read<T>(...)
.fromQuery("SELECT 'dummy' AS my_string_field_1")
.usingStandardSql()
.withMethod(Method.DIRECT_READ)) is free of the Storage API cost for the bytes of |
In your examples above:
This would incur only BigQuery storage API charges for the uncompressed size of the
This is a BigQuery query -- it will be executed as a query job, the query results will be written to an anonymous table, and then Beam will use the storage API to read the results from the anonymous table. You'll pay the standard $5/TiB on-demand query cost here (unless you're using a BigQuery reservation), but there won't be any costs associated with the storage API usage in this case because the target is an anonymous table. I think your last example sums things up correctly. |
Is this PR still active? |
I thought the plan was to merge this PR and then proceed with the update to remove readTableRows. Can we proceed with that plan? cc: @vachan-shetty |
If the plan is to merge this, could you:
|
The PR looks good to me. I'm not a Beam repository owner and can't provide formal approval. |
Retest this please |
Thanks. Will merge after tests pass. |
Retest this please |
Looks like there are style (spotless) issues. |
Hi @aaltay, is there a way to locally run linter or whatever static check is failing, I am having a hard time figuring out what could be wrong without any log message in CI? |
You can run |
Jira ticket was resolved but the docs haven't been updated accordingly with a snippet.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.