Skip to content

Autodetect for BQ happening automatically even with schema defined.  #847

Closed
@brummetj

Description

@brummetj

Issue

I'm currently using the python sdk to transform and load data into BQ. One of the fields I have is supposed to be a string but the data will come in as either a long float number (ex: 75257229615211513551680283362634662820), int number ( ex: 4 ), or string (ex: http://some-string-url ).

So the data representation looks like

 {
            "count": "6",
            "value": "75257229615211513551680283362634662820"
          },
          {
            "count": "4",
            "value": "5"
          },
          {
            "count": "4",
            "value": "http://some-string-url"
          }

Since BQ does the auto detection, these values can be recognized as either a FLOAT, INT, or STRING. I followed docs, and stack recommendations to assign a hard schema to the config which looks like

        table = bigquery.table.Table(table_ref)

        job_config = bigquery.LoadJobConfig()
        if table_name == self.incident_table_name:
            job_config.schema = self.incidents_schema

        if table_name == self.kpi_table_name:
            job_config.schema = self.kpi_schema
        
        job_config.autodetect = False  # should already be assigned to False in the SDK ( just for a hard assurance )

The schema is read in from a json file that looks like

  "fields": [
          {
            "mode": "NULLABLE",
            "name": "count",
            "type": "INTEGER"
          },
          {
            "mode": "NULLABLE",
            "name": "value",
            "type": "STRING"
          }
        ],

and parsed into the SchemaField() object..

    field_schema = bigquery.SchemaField(name=name,
       field_type=field_type,
       mode=mode,
       fields=subschema
   )

Just to be clear i do believe that this is apart of a subschema and assigned to the fields as it's an array of objects.

then i submit the data through the SDK via the load_table_from_json which looks like

            job = params.bq_client.load_table_from_json([incident],
                                                        job_config=job_config,
                                                        destination=table)
            result = job.result()

All fairly straightforward, and follows closely with what is documented and shared by others. BUT for whatever reason, the API request will fail and continue to try parse value as either a STRING, FLOAT or INT even with the schema defined.

So a couple questions here... is this a bug with BQ, or the SDK not accepting the schema? Am i doing something wrong here? Does the autodetect feature always do its thing even when turned off and a schema defined?

Any communication here would be greatly appreciated.

Cheers

info

sdk version: google-cloud-bigquery 2.20.0
os: MacOS
environment: Cloud Function
python version: 3.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: bigqueryIssues related to the googleapis/python-bigquery API.externalThis issue is blocked on a bug with the actual product.priority: p2Moderately-important priority. Fix may not be included in next release.status: will not fixInvalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions