BigQueryIO uniformize direct and export reads #32360

RustedBones · 2024-08-29T13:13:53Z

Refers to #26329, also fix #20100, #21076

When using readWithDatumReader and DIRECT_READ method, the transform would fail because the parseFn is expected. Refactor the IO so the avro datumReader can be use in both cases.

In some case, it is required to get the data with the desired schema. Currently, BQ io always uses the writer schema (or table schema). Create new APIs to set the reader schema.

This refactoring contains some breaking changes:

withFormat is not exposed anymore. Indeed, it is not possible to configure a TypedRead with a DatumReaderFactory and change the format later. Data format MUST be chosen when creating the transform.

In the TypedRead.Builder, replace the DatumReaderFactory with the BigQueryReaderFactory allowing to handle both avro and arrow in uniform fashion. This alters the BigQueryIOTranslation.
I need some help on that point to handle that in a better way.

github-actions · 2024-08-29T14:36:02Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

RustedBones · 2024-08-30T11:37:50Z

assign set of reviewers

github-actions · 2024-08-30T11:38:59Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

RustedBones · 2024-08-30T11:40:49Z

Some BQ integration tests are failing.

I don't know schema & data of the following big_query_import_export.parallel_read_table_row_xxx tables so I can recreated the setup in a personal GCP project. Can someone give me a hand ?

RustedBones · 2024-09-04T08:14:32Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+      // read table schema and infer coder if possible
+      Coder<T> c;
+      if (getCoder() == null) {
+        tableSchema = requestTableSchema(sourceDef, bqOptions, getSelectedFields());


Is it fine to access the BQ table at graph creation time? (It was already doing that when beam schema was requested)

Yeah this is a valid concern. I've heard use case where pipeline submission machine does not or has incomplete permission to the resource, and infer schema at graph creation time can cause issue. General guideline is the use case used to work should be able to work still (and vice versa)

RustedBones · 2024-09-04T08:16:57Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

@@ -1731,7 +1870,7 @@ public void processElement(ProcessContext c) throws Exception {
                                          .setTable(
                                              BigQueryHelpers.toTableResourceName(
                                                  queryResultTable.getTableReference()))
-                                          .setDataFormat(DataFormat.AVRO))


was arrow even supported ?

RustedBones · 2024-09-04T08:22:46Z

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryReaderFactory.java

+  // TODO make file source params generic (useAvroLogicalTypes)
+  abstract BoundedSource<T> getSource(
+      MatchResult.Metadata metadata,
+      TableSchema tableSchema,
+      Boolean useAvroLogicalTypes,


any proposal here ?
If there is a plan to support CSV export for instance, we'd have to pass the chosen field_delimiter

If there is a plan to support CSV export for instance

no major change on EXPORT mode planned afaik. Currently efforts focused on improving DIRECT_READ mode

RustedBones · 2024-09-04T08:23:38Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceDef.java

   * @throws BigQuerySchemaRetrievalException if schema retrieval fails
   */
-  Schema getBeamSchema(BigQueryOptions bqOptions);
+  TableSchema getTableSchema(BigQueryOptions bqOptions);


As this is in BQ realm, it makes more sense to return unaltered TableSchema

There was proposal to make Beam Schema uniform in future Beam versions (3.x), unfortunately BigQueryIO is the biggest IO that do not follow this (and has its own TableSchema that does not implements Serializable). I would suggest keep "getBeamSchema". We can have "getTableSchema" as addition.

RustedBones · 2024-09-04T08:32:04Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

May changes here:

generating avro schema from table schema has 2 options:

with logical types, as done in the BQ direct read, and BQ export with use_logical_type

without logical type, as done in BQ export. This conversion is destructive as many types fallback to String

converting GenericRecord to TableRow changed. It now expects the logical-type schema and thus can drop the need of the TableSchema for conversion

RustedBones · 2024-09-04T08:33:07Z

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java

        byte[] formatBytes = configRow.getBytes("format");
-        if (methodBytes != null) {


small bug here

RustedBones · 2024-09-04T08:44:37Z

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java

@@ -89,8 +95,8 @@ static class BigQueryIOReadTranslator implements TransformPayloadTranslator<Type
            .addNullableBooleanField("use_legacy_sql")
            .addNullableBooleanField("with_template_compatibility")
            .addNullableByteArrayField("bigquery_services")
-            .addNullableByteArrayField("parse_fn")
-            .addNullableByteArrayField("datum_reader_factory")
+            .addNullableByteArrayField("bigquery_reader_factory")


This is a complex object to serialize. subject to serialization error if there's changes between versions

RustedBones · 2024-09-04T09:16:58Z

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

-          .put("TIMESTAMP", Type.LONG)
-          .put("RECORD", Type.RECORD)
-          .put("STRUCT", Type.RECORD)
-          .put("DATE", Type.STRING)


This was probably the cause of #20677. Schema conversion was only taking the 1st occurence.
When writing a date we want a int with date logical type. We only want the string representation when reading an exported table without logical-types enabled.

github-actions · 2024-09-11T12:14:19Z

Reminder, please take a look at this pr: @robertwb @ahmedabu98

github-actions · 2024-09-28T12:13:59Z

Reminder, please take a look at this pr: @damondouglas @Abacn

github-actions · 2024-10-02T12:14:09Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2024-10-09T12:15:18Z

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions · 2024-10-14T12:14:17Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @johnjcasey for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

ahmedabu98 · 2024-10-14T12:25:39Z

waiting on author

It should be possible to read BQ avro data using a provided compatible avro schema for both file and direct read. Add readRows api Improve coder inference Self review Fix concurrency issue spotless checkstyle Ignore BigQueryIOTranslationTest Add missing project option to execute test Call table schema only if required Fix avro export without logical type checkstyle Add back float support FIx write test Add arrow support in translation

github-actions bot added java io extensions gcp zetasketch labels Aug 29, 2024

RustedBones force-pushed the bq-read-schema branch from 5dbe389 to 989ccdd Compare August 30, 2024 08:59

github-actions bot added the examples label Aug 30, 2024

github-actions bot added the Next Action: Reviewers label Aug 30, 2024

github-actions bot added examples and removed examples labels Aug 30, 2024

RustedBones marked this pull request as draft September 3, 2024 07:18

github-actions bot added examples and removed examples labels Sep 3, 2024

RustedBones marked this pull request as ready for review September 3, 2024 19:48

github-actions bot added examples and removed examples labels Sep 4, 2024

RustedBones commented Sep 4, 2024

View reviewed changes

github-actions bot added the slow-review label Sep 11, 2024

RustedBones mentioned this pull request Sep 17, 2024

Improve BQ <-> Avro conversions #32482

Merged

RustedBones marked this pull request as draft September 20, 2024 13:45

This was referenced Sep 23, 2024

BigQuery trim schema with selected fields #32514

Merged

BigQuey fix invalid null checks in io translation #32515

Merged

github-actions bot added the slow-review label Sep 28, 2024

github-actions bot removed the slow-review label Oct 2, 2024

github-actions bot added the slow-review label Oct 9, 2024

github-actions bot removed the slow-review label Oct 14, 2024

github-actions bot added Next Action: Author and removed Next Action: Reviewers labels Oct 14, 2024

RustedBones force-pushed the bq-read-schema branch from 448caf1 to 5bb43c5 Compare October 16, 2024 16:27

github-actions bot added examples Next Action: Reviewers and removed examples Next Action: Author labels Oct 16, 2024

RustedBones force-pushed the bq-read-schema branch from 5bb43c5 to 929d028 Compare October 16, 2024 16:32

github-actions bot added examples and removed examples labels Oct 16, 2024

RustedBones added 2 commits October 21, 2024 17:10

Update translation IO

49554f5

RustedBones force-pushed the bq-read-schema branch from 834277f to 49554f5 Compare October 21, 2024 15:10

github-actions bot added examples and removed examples labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQueryIO uniformize direct and export reads #32360

BigQueryIO uniformize direct and export reads #32360

RustedBones commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024

RustedBones commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

RustedBones commented Aug 30, 2024

RustedBones Sep 4, 2024

Abacn Sep 16, 2024

RustedBones Sep 4, 2024

RustedBones Sep 4, 2024

Abacn Sep 16, 2024

RustedBones Sep 4, 2024

Abacn Sep 16, 2024

RustedBones Sep 4, 2024

RustedBones Sep 4, 2024

RustedBones Sep 4, 2024

RustedBones Sep 4, 2024

github-actions bot commented Sep 11, 2024

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 2, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 14, 2024

ahmedabu98 commented Oct 14, 2024

		byte[] formatBytes = configRow.getBytes("format");
		if (methodBytes != null) {

BigQueryIO uniformize direct and export reads #32360

Are you sure you want to change the base?

BigQueryIO uniformize direct and export reads #32360

Conversation

RustedBones commented Aug 29, 2024 • edited Loading

github-actions bot commented Aug 29, 2024

RustedBones commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

RustedBones commented Aug 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2024

github-actions bot commented Sep 28, 2024

github-actions bot commented Oct 2, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 14, 2024

ahmedabu98 commented Oct 14, 2024

RustedBones commented Aug 29, 2024 •

edited

Loading