-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Upload pandas DataFrame containing arrays #19
Comments
Thanks for the report. Struct support is reported at https://github.com/googleapis/google-cloud-python/issues/8191, but I'll keep this issue open as a feature request for arrays of scalar types. |
I am not able to repro this issue with dataframe containing arrays only. Its not giving any exception. i can see an exception when we use dict(STRUCT/RECORD) in dataframe. @AETDDraper Could you please provide a sample dataframe containing arrays which causes the error. google-cloud-bigquery 1.20.0 |
Loading a dataframe containing an array (a Python type List) does work if you don't specific the schema in the
The detected schema looks like this:
There is no way to provide a schema with the ARRAY type though, which is a bit frustrating. Or even RECORD or STRUCT type. My config:
|
FWIW, uploading STRUCT columns through a dataframe will become possible when a fix in the At least uploading REPEATED fields (even those consisting of STRUCTs) works with the schema = [
bigquery.SchemaField(
"bar",
"STRUCT",
fields=[
bigquery.SchemaField("aaa", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("bbb", "INTEGER", mode="REQUIRED"),
],
mode="REPEATED",
),
]
json_data = [
{"bar": [{"aaa": 1, "bbb": 2}, {"aaa": 10, "bbb": 20}]},
{"bar": [{"aaa": 3, "bbb": 4}, {"aaa": 5, "bbb": 6}]},
]
job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_json(json_data, table_ref, job_config=job_config).result() |
As far as I can tell, the use case of loading a dataframe that contains an array (no struct) doesn't work even with the new version of Pyarrow (0.17.1). Dataframe looks like this:
Loading the dataframe: schema = [
bigquery.SchemaField("field_A", "INTEGER"),
bigquery.SchemaField("field_B", "FLOAT", "REPEATED"),
]
table = bigquery.Table(table_id, schema=schema)
bq_client.create_table(table, exists_ok=True)
job_config = bigquery.LoadJobConfig(
schema=schema,
destination=table_id,
write_disposition="WRITE_TRUNCATE",
)
job = bq_client.load_table_from_dataframe(data, table_id, job_config=job_config)
job.result() Produces:
|
@aaaaahaaaaa A use case of loading a dataframe that contains an array is not supported by bigquery and raised an error Parquet limitation note for Instead of using load_table_from_dataframe you can use load_table_from_json example: from google.cloud import bigquery
import pandas
client = bigquery.Client()
schema = [bigquery.SchemaField("nested_repeated", "INTEGER", mode="REPEATED")]
job_config = bigquery.LoadJobConfig(schema=schema)
data = [{"nested_repeated": record}]
client.load_table_from_json(data, "table_id", job_config=job_config).result() |
So just to confirm my read here, am I correct in saying: There is no way to use a
|
Pyarrow 2.0 released with improvements to the Parquet serialization. We should revisit to see if this issue can be resolved with pyarrow 2.0 |
@tswast I have tried with these two following examples and it's working fine. Example 1: import pandas as pd
import numpy as np
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(autodetect=True)
table_id = bigquery.Table('table')
table_id = client.create_table(table, exists_ok=True)
df = pd.DataFrame({'A': [np.array([1,2,3]), np.array([4,5,6]), np.array([7,8,9])]})
job = client.load_table_from_dataframe(df,"table_id",job_config=job_config).result() Example 2: table_id_1 = bigquery.Table('table'_1)
table_id_1= client.create_table(table, exists_ok=True)
job_config = bigquery.LoadJobConfig(autodetect=True)
df = pd.DataFrame([{'uuid':'test1', 'class':[['cls1','class2'], ['cls1','class3']]}])
job = client.load_table_from_dataframe(df,"table_id_1",job_config=job_config).result() |
Let's add some system test cases and/or code samples for this (which are skipped unless we have pyarrow >=2.0) before we close this issue out. I imagine it'd be useful to have samples which show more complex dataframes in the following docs: |
Based on the schema in #365, this feature is not yet supported. It still serializes to a strange format with an "item" column. |
We should look into how to integrate: --parquet_enable_list_inference from BQ CLI |
We can already use that flag: parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config = bigquery.LoadJobConfig()
job_config.parquet_options = parquet_options
job = client.load_table_from_dataframe(
df, TABLE_NAME, job_config=job_config
) |
^ this is true but it only partly resolves this issue or at least the specific complaint "It still serializes to a strange format with an "item" column.")
|
I think the probably has to do with implied nullability of the parquet schema (we would need to infer and remove nullness I think). |
This works as a workaround: df = pd.DataFrame([[[1,2,3]], [[4,5,6]]], columns=('int64_array',))
writer = pyarrow.BufferOutputStream()
pyarrow.parquet.write_table(
pyarrow.Table.from_pandas(df),
writer,
use_compliant_nested_type=True
)
reader = pyarrow.BufferReader(writer.getvalue())
client = bigquery.Client()
parquet_options = bigquery.format_options.ParquetOptions()
parquet_options.enable_list_inference = True
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.PARQUET
job_config.parquet_options = parquet_options
job = client.load_table_from_file(
reader, TABLE_NAME, job_config=job_config
) I promise it does for realz this time. |
@judahrand so the fix is using |
Both are needed. |
This is great, thanks @judahrand! Since as you point out the python-bigquery/google/cloud/bigquery/_pandas_helpers.py Lines 521 to 549 in ba02f24
|
Yeah, exposing it somewhere would be the best - maybe don't change the default just document it 😛 Only took 2+ years for an answer 🤣 |
I'm happy to look into this if no one else fancies. |
…#980) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary) Fixes #19 🦕
…googleapis#980) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [x] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary) Fixes googleapis#19 🦕
The support for python Bigquery API indicates that arrays are possible, however, when passing from a pandas dataframe to bigquery there is a pyarrow struct issue.
The only way round it seems its to drop columns then use JSON Normalise for a separate table.
This is the error recieved. NotImplementedError: struct
The reason I wanted to use this API as it indicates Nested Array support, which is perfect for our data lake in BQ but I assume this doesn't work?
The text was updated successfully, but these errors were encountered: