Description
Issue
I'm currently using the python sdk to transform and load data into BQ. One of the fields I have is supposed to be a string but the data will come in as either a long float number (ex: 75257229615211513551680283362634662820
), int number ( ex: 4
), or string (ex: http://some-string-url
).
So the data representation looks like
{
"count": "6",
"value": "75257229615211513551680283362634662820"
},
{
"count": "4",
"value": "5"
},
{
"count": "4",
"value": "http://some-string-url"
}
Since BQ does the auto detection, these values can be recognized as either a FLOAT, INT, or STRING. I followed docs, and stack recommendations to assign a hard schema to the config which looks like
table = bigquery.table.Table(table_ref)
job_config = bigquery.LoadJobConfig()
if table_name == self.incident_table_name:
job_config.schema = self.incidents_schema
if table_name == self.kpi_table_name:
job_config.schema = self.kpi_schema
job_config.autodetect = False # should already be assigned to False in the SDK ( just for a hard assurance )
The schema is read in from a json file that looks like
"fields": [
{
"mode": "NULLABLE",
"name": "count",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "value",
"type": "STRING"
}
],
and parsed into the SchemaField() object..
field_schema = bigquery.SchemaField(name=name,
field_type=field_type,
mode=mode,
fields=subschema
)
Just to be clear i do believe that this is apart of a subschema
and assigned to the fields
as it's an array of objects.
then i submit the data through the SDK via the load_table_from_json
which looks like
job = params.bq_client.load_table_from_json([incident],
job_config=job_config,
destination=table)
result = job.result()
All fairly straightforward, and follows closely with what is documented and shared by others. BUT for whatever reason, the API request will fail and continue to try parse value
as either a STRING, FLOAT or INT even with the schema defined.
So a couple questions here... is this a bug with BQ, or the SDK not accepting the schema? Am i doing something wrong here? Does the autodetect feature always do its thing even when turned off and a schema defined?
Any communication here would be greatly appreciated.
Cheers
info
sdk version: google-cloud-bigquery 2.20.0
os: MacOS
environment: Cloud Function
python version: 3.9