Skip to content

Conversation

dackers86
Copy link
Member

@dackers86 dackers86 commented Mar 16, 2022

The BigQuery extension currently uses a stringified JSON object to store data when synching from Firestore. This PR prepares for when a JSON type becomes available.

A new JSON datatype is currently in preview and can be tracked here https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data

  • Update the default schema to use a JSON datatype
  • Ensure new tables are generated with the new datatype

fixes: #1775

@dackers86 dackers86 added the do not merge Do not merge this Pull Request label Mar 16, 2022
@dackers86 dackers86 force-pushed the @invertase/add-bq-json-schema-type branch from 5502347 to 503ba1c Compare March 16, 2022 10:20
@dackers86 dackers86 requested a review from a team as a code owner March 29, 2022 09:08
@IchordeDionysos
Copy link
Contributor

@dackers86 when is it planned to merge this PR? Sounds like a helpful addition 😍

Copy link
Contributor

@IchordeDionysos IchordeDionysos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, but there are some missing pieces, especially the latest view won't work anymore when using the JSON data type

fields: [
...defaultViewSchemaFields,
{
name: "data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field path_params should also be made a JSON object depending on the users selection.

schema.fields.push(documentPathParams);
}

const latestSnapshot = latestConsistentSnapshotView(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest BigQuery view also needs to be adjusted as it's not possible to group a JSON column.

This is what comes up when you have data as a JSON field, with the auto-createdcollection_raw_latest view.
image

@dackers86
Copy link
Member Author

Thanks IchordeDionysos. Now that the JSON updated is out of it's original preview status we can resume development on this. The next steps will be to address the issues you have pointed out in terms of the views and path_params.

@nfarina
Copy link

nfarina commented Nov 9, 2022

I was able to make this work by using SELECT …, ARRAY_AGG(data)[offset(0)] as data, instead of attempting to GROUP BY data since grouping by native JSON type is still unsupported.

However, this removes most of the advantages of the JSON type! When querying the resulting View, the entire data column is loaded (and billed) instead of a subset. For instance SELECT data.age FROM latest will read the same number of bytes as SELECT data FROM latest.

This is as opposed to SELECT data.age FROM changelog which only bills for the age column (since native JSON types are spread out into native subcolumns at ingestion time).

I don't know enough about BigQuery to suggest an alternative strategy, but I really hope there is a way to have our cake and eat it too here.

@dackers86
Copy link
Member Author

Moving this back to under consideration for planning. This PR was an initial start but we need to consider and plan how this fits into the overall BQ schema and development.

@basvandorst
Copy link

We manually changed the datatype of the changelog-table to JSON and had indeed to deal with the same group by issue.

We fixed it in this way:

WITH RankedData AS (
  SELECT
    document_name,
    document_id,
    timestamp,
    event_id,
    operation,
    data,
    path_params,
    ROW_NUMBER() OVER(PARTITION BY document_name ORDER BY timestamp DESC) AS row_num
  FROM `collection`
)
SELECT
  document_name,
  document_id,
  timestamp,
  event_id,
  operation,
  data,
  path_params
FROM RankedData
WHERE row_num = 1 

Don't know enough about the inner workings /performance impact of BQ but for us it works fine!

@cabljac cabljac force-pushed the next branch 3 times, most recently from 1fad407 to 27370ce Compare March 21, 2024 09:01
@cabljac cabljac force-pushed the next branch 3 times, most recently from c8885bd to d1481a2 Compare April 24, 2024 10:27
@tiagosilveiradev
Copy link

Any updates on this? It would be very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge Do not merge this Pull Request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[firestore-bigquery-export] change datatype in changelog table from STRING to JSON
6 participants