BigQuery Time Values May Be Incorrect #1125

dhermes · 2015-09-08T04:49:22Z

On the _datetime_from_json function, @arterrey commented that

I'm getting values out of BigQuery in seconds. So * 1000.0 is not enough - it needs to be * 1000000.0 to work. However, I'm not sure if that is a bug in the BigQuery API or a bug in this client library.

This contradicts the note in the code that

Field value will be in milliseconds.

The text was updated successfully, but these errors were encountered:

dhermes · 2015-09-08T04:50:29Z

@arterrey Can you elaborate where you are "getting values out of BigQuery in seconds"? Like an example use of this library and how the output differs from what you expect?

Also making it require a float (or int) instead of bailing out for NoneType. Changing the use of the method in bigquery to not pass potentially null values and to multiply by 1000.0 (convert from millis to micros). Also updating micros -> datetime conversion in datastore to use the newly converted method.

arterrey · 2015-09-08T05:00:54Z

The field value from the web interface is "2012-06-12 08:05:14 UTC". From the pdb session below you can see that the value returns a dattime in 1970 rather then 2012.

(Pdb) list 762
757     def _bool_from_json(value, field):
758         if _not_null(value, field):
759             return value.lower() in ['t', 'true', '1']
760     
761     
762     def _datetime_from_json(value, field):
763         if _not_null(value, field):
764             import pdb; pdb.set_trace()
765             # Field value will be in milliseconds.        
766  ->         return _datetime_from_microseconds(1000.0 * float(value))
767     
(Pdb) value
u'1.33948831400937E9'
(Pdb) float(value)
1339488314.00937
(Pdb) _datetime_from_microseconds(1000.0 * float(value))
datetime.datetime(1970, 1, 16, 12, 4, 48, 314009, tzinfo=<UTC>)
(Pdb) _datetime_from_microseconds(1000000.0 * float(value))
datetime.datetime(2012, 6, 12, 8, 5, 14, 9370, tzinfo=<UTC>)
(Pdb)

dhermes · 2015-09-08T05:08:47Z

That's really great info! Thanks for showing the pdb stack.

It looks like

value == 1339.488314009370

Which request did you make to retrieve this data? Can you provide a code snippet?

arterrey · 2015-09-08T05:15:21Z

I think you mean

    value == 1339488314.00937

just using fetch_data...

    def fetch_table(data_set, table_name):

        table = data_set.table(table_name)
        assert table.exists()
        table.reload()

        next_page = None

        while (True):
            rows, total_count, next_page = table.fetch_data(page_token=next_page)
            for row in rows:

                # use schema to convert tuple into dict
                row_dict = {}
                for i in range(len(table.schema)):
                    field = table.schema[i]
                    row_dict[field.name] = row[i]

                yield PathableDict(row_dict)

            if next_page is None:
                break

dhermes · 2015-09-08T05:29:22Z

Sorry I overlooked the actual value:

(Pdb) value
u'1.33948831400937E9'

It seems that instead of

datetime.timedelta(microseconds=1000.0 * float(value))

we should just be using

datetime.timedelta(seconds=float(value))

Like you said, the value is seconds, not milliseconds.

I couldn't find any evidence of "seconds" in the discovery doc:
https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest

FWIW the bigquery.tabledata.list returns a TableDataList, and the schema for that type has TableDataList.rows a list of TableRow, which in turn has TableRow.f a list of TableCell which then is an any type (which is kind of a black hole).

I'm off to bed for now but will try to throw in some test data (from our system tests) and inspect the output on the APIs explorer:
https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tabledata.list

Also FWIW our link in our docstring:
https://cloud.google.com/bigquery/reference/rest/v2/tabledata/list
is broken. It should be:
https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list

Really just

rows, total_count, next_page = table.fetch_data(page_token=next_page)

was what I was looking for. Thanks!

tseaver · 2015-09-08T18:22:13Z

I just created a table with the following schema:

[
   {
    "name": "Timestamp",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
   },
   {
    "name": "Description",
    "type": "STRING",
    "mode": "REQUIRED"
   },
   {
    "name": "Quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
   },
   {
    "name": "Unit_Price",
    "type": "FLOAT",
    "mode": "REQUIRED"
   }
]

and uploaded the following data:

$ cat /tmp/with_timestamp.csv 
2015-09-09T17:59:22Z,Testing,1,2.5

In the developer console, the data looks like:

Row  Timestamp                Description  Quantity Unit_Price   
1    2015-09-09 17:59:22 UTC  Testing      1        2.5

In the API explorer, tabledata.list returns:

{
 "kind": "bigquery#tableDataList",
 "etag": "\"m-9bZ2XqlR7n8jGvqgVmLOTsm1s/xddlFZR3Ob1xPfHtKV-u3wV2TLU\"",
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": "1.441821562E9"
    },
    {
     "v": "Testing"
    },
    {
     "v": "1"
    },
    {
     "v": "2.5"
    }
   ]
  }
 ]
}

The Preparing data for bigquery docs don't specify how timestamps are returned, only the formats which can be used to load the data. It does say,

BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.

dhermes · 2015-09-08T18:36:01Z

So what shall we do? Also, how does 2015-09-09T17:59:22Z,Testing,1,2.5 make it through our pipeline?

tseaver · 2015-09-08T18:44:14Z

Given that the values are always returned as scientific-formatted floats of seconds (unlike datastore timestamps, which are stored as microseconds), we should be using float(value).

How does 2015-09-09T17:59:22Z,Testing,1,2.5 make it through our pipeline?

ISO-8660 strings are one of the supported formats for loading TIMESTAMP data.

dhermes · 2015-09-08T19:20:27Z

I mean what is the codepath.

tseaver · 2015-09-08T19:25:33Z

Client.load_table_from_storage could trigger that path.

dhermes · 2015-09-08T20:39:53Z

I see. So the CSV goes directly through GCS, so our code doesn't actually handle the timestamp string?

tseaver · 2015-09-08T20:53:42Z

Yup. Table.insert_data expects clients to pass in real Python datetime.datetime instances for TIMESTAMP fields.

dhermes added the api: bigquery Issues related to the BigQuery API. label Sep 8, 2015

dhermes assigned tseaver Sep 8, 2015

dhermes changed the title ~~Bigquery Time Values May Be Incorrect~~ BigQuery Time Values May Be Incorrect Sep 8, 2015

dhermes mentioned this issue Sep 8, 2015

Big query TIMESTAMP field off by factor of 1000 #1126

Closed

tseaver mentioned this issue Sep 8, 2015

Fix conversion of backend-generated timestamps #1128

Merged

tseaver closed this as completed in e004a76 Sep 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery Time Values May Be Incorrect #1125

BigQuery Time Values May Be Incorrect #1125

dhermes commented Sep 8, 2015

dhermes commented Sep 8, 2015

arterrey commented Sep 8, 2015

dhermes commented Sep 8, 2015

arterrey commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

BigQuery Time Values May Be Incorrect #1125

BigQuery Time Values May Be Incorrect #1125

Comments

dhermes commented Sep 8, 2015

dhermes commented Sep 8, 2015

arterrey commented Sep 8, 2015

dhermes commented Sep 8, 2015

arterrey commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015

dhermes commented Sep 8, 2015

tseaver commented Sep 8, 2015