Open
Description
Thanks for stopping by to let us know something could be better!
PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.
Is your feature request related to a problem? Please describe.
- (P1 because it's so simple) When uploading a dataframe, I get a Pyarrow could not determine the type of columns" warning raised with with pyarrow.large_string() columns. this should be a trivial addition in _ARROW_SCALAR_IDS_TO_BQ.
- (P2 because it takes longer to fix ) When downloading results can we support ( maybe even default to using) pyarrow.large_string ( and other pyarrow.large(*) instead of pyarrow.string for
QueryJob.to_arrow
.pyarrow.string
has a 2GiB limit on the size of the data in the column (not just in a single element) that's guaranteed to work correctly. If query results are bigger, they might not immediately break because the data is usually chunked smaller, but many dataframe operations ( like aggregations or even indexing) on these columns trigger a "ArrowInvalid: offset overflow" error. This is mainly caused by bad decisions in Arrow ([C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays apache/arrow#33049), but we can try to keep BQ users safe. The performance/ memory hit has usually been small, and 2GiB is very easy to cross.
Describe the solution you'd like
- add pyarrow.large_* keys to _ARROW_SCALAR_IDS_TO_BQ
- add an option or default to return large_* types in
QueryJob.to_arrow
Describe alternatives you've considered
For 2, I have converted the string columns to large_string myself immediately after loading, and it has not triggered issues yet, but the Arrow API does not seem to guarantee that this should continue to work.
Additional context