Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

Closed
tswast opened this issue May 24, 2019 · 7 comments · Fixed by #9108
Closed
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented May 24, 2019

Follow-up to #8105 (comment)

When a table schema isn't supplied in load_table_from_dataframe, try to get the existing table schema. This will prevent errors due to ambiguous pandas types (#7370) without having to explicitly provide a schema.

Note: this behavior is similar to that of pandas-gbq, which always fetches the table schema and then compares to make sure it's compatible with the dataframe schema.

https://github.com/pydata/pandas-gbq/blob/59228d9c20cee12b24caa5cc41d3f2e6c0337932/pandas_gbq/gbq.py#L1115-L1121

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels May 24, 2019
@timocb
Copy link

timocb commented Aug 7, 2019

Hi @tswast ! I was looking into how to solve this issue, because #8105 has closed my issue #8093. It would be great if we can do this in the background.

Would this be as simple as adding the following code here https://github.com/googleapis/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/client.py#L1523?

This code would get the schema of the destination table and applies it to the job_config, if there is no schema info available.

if not job_config.schema:
    client = Client()
    job_config.schema = client.get_table(destination).schema

Note: I am not sure if you want to initialize the client here.

@tswast
Copy link
Contributor Author

tswast commented Aug 7, 2019

Note: I am not sure if you want to initialize the client here.

A client object is already available as self.

job_config.schema = client.get_table(destination).schema

This is necessary, but not sufficient. There are several cases to handle.

  • A: DataFrame has same set of columns as existing table. This is the easy case, as it's what google.cloud.bigquery._pandas_helpers.dataframe_to_parquet already expects.
  • B: DataFrame has new columns that aren't present in the existing table.
  • C: DataFrame is missing columns that are present in the existing table.

Note: B & C can both be true if there are some new columns and some missing columns in the DataFrame.

Also, for those users that are using fastparquet instead of pyarrow, I don't want to force them to have to use pyarrow, as that's a breaking change (though given the difference in behavior, we may want to consider dropping fastparquet as a supported serialization library).

@timocb
Copy link

timocb commented Aug 8, 2019

@tswast

A client object is already available as self.

Oh, of course :P

There are several cases to handle.

How are these cases (B and C) expected to be handled?

Also, for those users that are using fastparquet instead of pyarrow, I don't want to force them to have to use pyarrow, as that's a breaking change (though given the difference in behavior, we may want to consider dropping fastparquet as a supported serialization library).

I am not sure if I understand how you would force users to use pyarrow. Can you elaborate?

@tswast
Copy link
Contributor Author

tswast commented Aug 21, 2019

B: DataFrame has new columns that aren't present in the existing table.
C: DataFrame is missing columns that are present in the existing table.

#9064 actually handles both these cases, as it filters the schema by column name and re-orders the schema to match the DataFrame column order.

@tswast
Copy link
Contributor Author

tswast commented Aug 24, 2019

FYI: #9096 will affect this implementation. After getting the table schema, you'll have to filter out any columns not present in the dataframe.

If there are any columns in the dataframe that aren't present in the table, we have 2 options:

  1. Fail. There's a schema mismatch.
  2. Attempt to update the table schema to add the new columns before uploading. This is a bit treacherous, especially if the column dtype is object.

I believe option 2 is what pandas-gbq does, but option 1 is current behavior. If we do want to pursue option 2, then we should file it as a separate feature request.

@tswast
Copy link
Contributor Author

tswast commented Aug 27, 2019

As I'm writing some samples for this, I'm realizing we probably don't want to fetch the schema if the write disposition is WRITE_TRUNCATE because the previous table schema doesn't matter in that case.

@plamut
Copy link
Contributor

plamut commented Aug 27, 2019

@tswast Sounds like something to update in the PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants