Skip to content

Support pyarrow.large_* as column type in dataframe upload/ download #1706

Open
@cvm-a

Description

@cvm-a

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe.

  1. (P1 because it's so simple) When uploading a dataframe, I get a Pyarrow could not determine the type of columns" warning raised with with pyarrow.large_string() columns. this should be a trivial addition in _ARROW_SCALAR_IDS_TO_BQ.
  2. (P2 because it takes longer to fix ) When downloading results can we support ( maybe even default to using) pyarrow.large_string ( and other pyarrow.large(*) instead of pyarrow.string for
    QueryJob.to_arrow. pyarrow.string has a 2GiB limit on the size of the data in the column (not just in a single element) that's guaranteed to work correctly. If query results are bigger, they might not immediately break because the data is usually chunked smaller, but many dataframe operations ( like aggregations or even indexing) on these columns trigger a "ArrowInvalid: offset overflow" error. This is mainly caused by bad decisions in Arrow ([C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays apache/arrow#33049), but we can try to keep BQ users safe. The performance/ memory hit has usually been small, and 2GiB is very easy to cross.

Describe the solution you'd like

  1. add pyarrow.large_* keys to _ARROW_SCALAR_IDS_TO_BQ
  2. add an option or default to return large_* types in QueryJob.to_arrow

Describe alternatives you've considered
For 2, I have converted the string columns to large_string myself immediately after loading, and it has not triggered issues yet, but the Arrow API does not seem to guarantee that this should continue to work.
Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: bigqueryIssues related to the googleapis/python-bigquery API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions