Skip to content

[BEAM-7034] Add example snippet to read fromQuery using BQ Storage API. #13083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 28, 2021
Merged

Conversation

fpopic
Copy link
Contributor

@fpopic fpopic commented Oct 13, 2020

Jira ticket was resolved but the docs haven't been updated accordingly with a snippet.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@fpopic
Copy link
Contributor Author

fpopic commented Oct 13, 2020

Should we add a note about the pricing?

Does using BigQuery Storage API + fromQuery combines the cost:

= BigQuery query pricing + BigQuery Storage API pricing on top?

@fpopic fpopic marked this pull request as ready for review October 13, 2020 12:18
@fpopic
Copy link
Contributor Author

fpopic commented Oct 21, 2020

R: @kennknowles

@aaltay
Copy link
Member

aaltay commented Nov 12, 2020

R: @kmjung @chamikaramj

@kmjung
Copy link
Contributor

kmjung commented Nov 12, 2020

cc: @vachan-shetty

pipeline
.apply(
"Read from BigQuery table",
BigQueryIO.readTableRows()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using readTableRows in an example snippet, both for the storage API and also for the existing export-based model -- this involves a needless conversion from Avro to JSON, where customers should instead be able to consume the Avro GenericRecords directly.

Copy link
Contributor Author

@fpopic fpopic Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, agree. What would be prefered way to continue with this then?

  1. Finish this PR with using TableRows to have all 3 read examples using the same undesired readTableRows() call
  2. refactor this example only to use read<T>(SerializableFunction<SchemaAndRecord, T> f) as a part of this PR
  3. refactor all 3 examples using the preferred read<T>(SerializableFunction<SchemaAndRecord, T> f)?
    • Reading from a table
    • Reading with a query string
    • Using the BigQuery Storage API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the cycles, let's do (3). Otherwise, you can go ahead with (1) and I will take care of updating them when you're done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's merge this, and next week I can refactor all 3 examples.

@kmjung
Copy link
Contributor

kmjung commented Nov 13, 2020

Also, re: the question above about pricing: the storage API is free when used to read anonymous tables (e.g. query results). Users pay only when scanning from a named table.

@aaltay
Copy link
Member

aaltay commented Nov 19, 2020

@fpopic - Could you address the open comments?

@fpopic
Copy link
Contributor Author

fpopic commented Nov 20, 2020

Also, re: the question above about pricing: the storage API is free when used to read anonymous tables (e.g. query results). Users pay only when scanning from a named table.

Let me understand on a small example.

Does it mean that for my existing named table myproject:mydataset.mytable with the following schema:

[
  {
    "mode": "NULLABLE",
    "name": "my_string_field_1",
    "type": "STRING"
    },
  {
    "mode": "NULLABLE",
    "name": "my_string_field_2",
    "type": "STRING"
  }
]
  1. Option A

    BigQueryIO
    .read<T>(...)
    .from("myproject.mydataset.mytable")
    .withSelectedFields("my_string_field_1")
    .withMethod(Method.DIRECT_READ))

    Would only include the BigQuery Query scan cost of the field my_string_field_1 + Storage API scan cost for the field my_string_field_1?

  2. Option B

    BigQueryIO
    .read<T>(...)
    .fromQuery("SELECT my_string_field_1 || 'my_concat_business_logic_for_this_field' FROM `myproject.mydataset.mytable`")
    .usingStandardSql()
    .withMethod(Method.DIRECT_READ))

    And here the cost would only include the BigQuery Query scan cost for the field my_string_field_1?

Or you are just saying that anonymous table scan

BigQueryIO
.read<T>(...)
.fromQuery("SELECT 'dummy' AS my_string_field_1")
.usingStandardSql()
.withMethod(Method.DIRECT_READ))

is free of the Storage API cost for the bytes of dummy bytes?

@kmjung
Copy link
Contributor

kmjung commented Nov 20, 2020

In your examples above:

BigQueryIO
    .read<T>(...)
    .from("myproject.mydataset.mytable")
    .withSelectedFields("my_string_field_1")
    .withMethod(Method.DIRECT_READ))

This would incur only BigQuery storage API charges for the uncompressed size of the my_string_field_1 column (e.g. at $1.10/TiB). The BigQuery query engine isn't involved here, and so neither is the $5/TiB query cost.

BigQueryIO
    .read<T>(...)
    .fromQuery("SELECT my_string_field_1 || 'my_concat_business_logic_for_this_field' FROM `myproject.mydataset.mytable`")
    .usingStandardSql()
    .withMethod(Method.DIRECT_READ))

This is a BigQuery query -- it will be executed as a query job, the query results will be written to an anonymous table, and then Beam will use the storage API to read the results from the anonymous table. You'll pay the standard $5/TiB on-demand query cost here (unless you're using a BigQuery reservation), but there won't be any costs associated with the storage API usage in this case because the target is an anonymous table.

I think your last example sums things up correctly.

@aaltay
Copy link
Member

aaltay commented Jan 14, 2021

Is this PR still active?

@fpopic
Copy link
Contributor Author

fpopic commented Jan 15, 2021

Is this PR still active?

@kmjung we stopped here

@kmjung
Copy link
Contributor

kmjung commented Jan 15, 2021

I thought the plan was to merge this PR and then proceed with the update to remove readTableRows. Can we proceed with that plan? cc: @vachan-shetty

@aaltay
Copy link
Member

aaltay commented Jan 15, 2021

I thought the plan was to merge this PR and then proceed with the update to remove readTableRows. Can we proceed with that plan? cc: @vachan-shetty

If the plan is to merge this, could you:

  • Review and LGTM the PR
  • Check that all tests are passing.

@kmjung
Copy link
Contributor

kmjung commented Jan 15, 2021

The PR looks good to me. I'm not a Beam repository owner and can't provide formal approval.

@chamikaramj
Copy link
Contributor

Retest this please

@chamikaramj
Copy link
Contributor

Thanks. Will merge after tests pass.

@chamikaramj
Copy link
Contributor

Retest this please

@aaltay
Copy link
Member

aaltay commented Jan 20, 2021

Looks like there are style (spotless) issues.

@fpopic
Copy link
Contributor Author

fpopic commented Jan 23, 2021

Looks like there are style (spotless) issues.

Hi @aaltay, is there a way to locally run linter or whatever static check is failing, I am having a hard time figuring out what could be wrong without any log message in CI?

@kennknowles
Copy link
Member

You can run ./gradlew spotlessApply and it will automatically fix it.

@chamikaramj chamikaramj merged commit f24ebd3 into apache:master Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants