BigQuery: get table schema if not supplied (and have pyarrow) in `load_table_from_dataframe` #8142

tswast · 2019-05-24T19:29:59Z

When a table schema isn't supplied in load_table_from_dataframe, try to get the existing table schema. This will prevent errors due to ambiguous pandas types (#7370) without having to explicitly provide a schema.

Note: this behavior is similar to that of pandas-gbq, which always fetches the table schema and then compares to make sure it's compatible with the dataframe schema.

https://github.com/pydata/pandas-gbq/blob/59228d9c20cee12b24caa5cc41d3f2e6c0337932/pandas_gbq/gbq.py#L1115-L1121

The text was updated successfully, but these errors were encountered:

timocb · 2019-08-07T13:43:01Z

Hi @tswast ! I was looking into how to solve this issue, because #8105 has closed my issue #8093. It would be great if we can do this in the background.

Would this be as simple as adding the following code here https://github.com/googleapis/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/client.py#L1523?

This code would get the schema of the destination table and applies it to the job_config, if there is no schema info available.

if not job_config.schema:
    client = Client()
    job_config.schema = client.get_table(destination).schema

Note: I am not sure if you want to initialize the client here.

tswast · 2019-08-07T17:12:51Z

Note: I am not sure if you want to initialize the client here.

A client object is already available as self.

job_config.schema = client.get_table(destination).schema

This is necessary, but not sufficient. There are several cases to handle.

A: DataFrame has same set of columns as existing table. This is the easy case, as it's what google.cloud.bigquery._pandas_helpers.dataframe_to_parquet already expects.
B: DataFrame has new columns that aren't present in the existing table.
C: DataFrame is missing columns that are present in the existing table.

Note: B & C can both be true if there are some new columns and some missing columns in the DataFrame.

Also, for those users that are using fastparquet instead of pyarrow, I don't want to force them to have to use pyarrow, as that's a breaking change (though given the difference in behavior, we may want to consider dropping fastparquet as a supported serialization library).

timocb · 2019-08-08T09:36:45Z

@tswast

A client object is already available as self.

Oh, of course :P

There are several cases to handle.

How are these cases (B and C) expected to be handled?

Also, for those users that are using fastparquet instead of pyarrow, I don't want to force them to have to use pyarrow, as that's a breaking change (though given the difference in behavior, we may want to consider dropping fastparquet as a supported serialization library).

I am not sure if I understand how you would force users to use pyarrow. Can you elaborate?

tswast · 2019-08-21T19:55:36Z

B: DataFrame has new columns that aren't present in the existing table.
C: DataFrame is missing columns that are present in the existing table.

#9064 actually handles both these cases, as it filters the schema by column name and re-orders the schema to match the DataFrame column order.

tswast · 2019-08-24T00:19:48Z

FYI: #9096 will affect this implementation. After getting the table schema, you'll have to filter out any columns not present in the dataframe.

If there are any columns in the dataframe that aren't present in the table, we have 2 options:

Fail. There's a schema mismatch.
Attempt to update the table schema to add the new columns before uploading. This is a bit treacherous, especially if the column dtype is object.

I believe option 2 is what pandas-gbq does, but option 1 is current behavior. If we do want to pursue option 2, then we should file it as a separate feature request.

tswast · 2019-08-27T17:25:13Z

As I'm writing some samples for this, I'm realizing we probably don't want to fetch the schema if the write disposition is WRITE_TRUNCATE because the previous table schema doesn't matter in that case.

plamut · 2019-08-27T17:41:01Z

@tswast Sounds like something to update in the PR?

tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. labels May 24, 2019

tswast mentioned this issue May 24, 2019

Use job_config.schema for data type conversion if specified in load_table_from_dataframe. #8105

Merged

tswast mentioned this issue Aug 5, 2019

BigQuery: Allow choice of compression when loading from dataframe #8938

Merged

tswast mentioned this issue Aug 16, 2019

BigQuery: Deprecate automatic schema conversion in load_table_from_dataframe #9042

Closed

4 tasks

tswast assigned plamut Aug 16, 2019

tswast mentioned this issue Aug 20, 2019

BigQuery: Raise helpful error when loading table from dataframe with STRUCT columns #9053

Merged

plamut mentioned this issue Aug 27, 2019

BigQuery: Autofetch table schema on load if not provided #9108

Merged

plamut closed this as completed in #9108 Sep 4, 2019

henryharbeck mentioned this issue Jul 20, 2024

Cannot append to REQUIRED field when using client.load_table_from_file without providing table schema googleapis/python-bigquery#1981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: get table schema if not supplied (and have pyarrow) in `load_table_from_dataframe` #8142

BigQuery: get table schema if not supplied (and have pyarrow) in `load_table_from_dataframe` #8142

tswast commented May 24, 2019

timocb commented Aug 7, 2019 •

edited

Loading

tswast commented Aug 7, 2019 •

edited

Loading

timocb commented Aug 8, 2019 •

edited

Loading

tswast commented Aug 21, 2019

tswast commented Aug 24, 2019

tswast commented Aug 27, 2019

plamut commented Aug 27, 2019

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

BigQuery: get table schema if not supplied (and have pyarrow) in load_table_from_dataframe #8142

Comments

tswast commented May 24, 2019

timocb commented Aug 7, 2019 • edited Loading

tswast commented Aug 7, 2019 • edited Loading

timocb commented Aug 8, 2019 • edited Loading

tswast commented Aug 21, 2019

tswast commented Aug 24, 2019

tswast commented Aug 27, 2019

plamut commented Aug 27, 2019

BigQuery: get table schema if not supplied (and have pyarrow) in `load_table_from_dataframe` #8142

BigQuery: get table schema if not supplied (and have pyarrow) in `load_table_from_dataframe` #8142

timocb commented Aug 7, 2019 •

edited

Loading

tswast commented Aug 7, 2019 •

edited

Loading

timocb commented Aug 8, 2019 •

edited

Loading