You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m not sure if the behavior described below is expected and I'm just missing something or is a bug.
When uploading a Parquet file created with PyArrow to Google BigQuery, columns containing simple lists (e.g., List[str], List[int], List[float]) are interpreted by BigQuery as RECORD types with REPEATED mode instead of the expected primitive types (STRING, INTEGER, FLOAT) with REPEATED mode.
I've tried explicitly defining the schema in BigQuery and ensuring that the Parquet file’s schema matches but the behavior persists.
I have an alternative workaround in mind (via JSON) but would prefer to continue using PyArrow and parquet.
Example Code
To reproduce create a Parquet file using PyArrow that includes some columns with lists of integers, strings, and floats. Upload this Parquet file to BigQuery via a bucket and inspect the table schema and field values.
I would expect BigQuery to recognize the int_column, str_column, and float_column as arrays of integers, strings, and floats respectively (with REPEATED mode). However, it interprets these columns as RECORD types with REPEATED mode which complicates the data handling.
kcullimore
changed the title
[Python][BigQuery][Parquet] BigQuery Interprets List Fields in Parquet Files as RECORD Types with REPEATED Mode
[Python][BigQuery][Parquet] BigQuery Interprets Simple List Fields in Parquet Files as RECORD Types with REPEATED Mode
Nov 9, 2024
Describe the enhancement requested
I’m not sure if the behavior described below is expected and I'm just missing something or is a bug.
When uploading a Parquet file created with PyArrow to Google BigQuery, columns containing simple lists (e.g., List[str], List[int], List[float]) are interpreted by BigQuery as RECORD types with REPEATED mode instead of the expected primitive types (STRING, INTEGER, FLOAT) with REPEATED mode.
The example input schema is:
After uploading to a BigQuery table via a parquet file it returns the following schema (after querying and converting back to an arrow table):
I've tried explicitly defining the schema in BigQuery and ensuring that the Parquet file’s schema matches but the behavior persists.
I have an alternative workaround in mind (via JSON) but would prefer to continue using PyArrow and parquet.
Example Code
To reproduce create a Parquet file using PyArrow that includes some columns with lists of integers, strings, and floats. Upload this Parquet file to BigQuery via a bucket and inspect the table schema and field values.
I would expect BigQuery to recognize the
int_column
,str_column
, andfloat_column
as arrays of integers, strings, and floats respectively (with REPEATED mode). However, it interprets these columns as RECORD types with REPEATED mode which complicates the data handling.Environment:
• Python 3.11.10
• Ubuntu 22.04.5
• pyarrow==18.0.0
• google-cloud-bigquery==3.26.0
• google-cloud-storage==2.18.2
Component(s)
Python
The text was updated successfully, but these errors were encountered: