Skip to content

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020

Closed
@aschmolck

Description

@aschmolck

In many cases INTEGER columns in bigquery are used to store IDs. Such IDs are then likely to be silently corrupted on conversion to dataframe. There seems no good way around this, even if it's clear a priori that the query result will only return non-null INTEGER columns.

I understand the motivation (the ability to represent NULL values easily and efficiently), but this both seems an undesirable situation (at the very least pandas should give a warning for integers that are not exactly convertible to floats) and inconsistent to how sql queries work in pandas.

I tried to see if it would be possible to consider the 'mode' (NULLABLE or not) of the field when doing the conversion, but that part of the schema information turns out to be of very limited value in bigquery since even trivial projections turn REQUIRED fields into NULLABLE.

I'm not sure what the best solution is, but I think bpq and sql should behave identically with respect to NULLs and that at the very least there should be a warning on information losing integer->float conversions.

So my preference would be to check the returned json query data for nulls and use np.dtype(int64) if there are none in the INTEGER column, and otherwise 'object'. Where that's deemed to inefficient, I think it's better to add a read_csv style dtypes argument to allow users to explicitly request float representation for a particular column (or just cast directly in the pseudo-SQL query itself). I'd rather have my code be slower (and possibly see a runtime warning to that effect) than unwittingly corrupt data (without any warning).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversions

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions