standard bigquery v2, support nested schema in query result #1593

c0b · 2016-09-14T17:44:00Z

Please support nested schema field names in complex queries like SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d

const BigQuery = require('@google-cloud/bigquery');
> BigQuery().query({ query: 'SELECT 1 AS cnt', useLegacySql: false }, function (err, data) { if (err) return console.error(err); console.dir(data, { depth: null, colors: true }); });
> [ { cnt: 1 } ]

This is right, the field name cnt is correctly used

> BigQuery().query({ query: 'SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d', useLegacySql: false }, function (err, data) { if (err) return console.error(err); console.dir(data, { depth: null, colors: true }); });
> [ { d: [ { v: { f: [ { v: '1' }, { v: '2' }, { v: '3' } ] } } ] } ]

only the outer level field name d is in use, not others a, b, c,
all get default f and v, and all their values type, the 1,2,3 here should be integers

I have even more complex queries than this, all are like this only the outer one level schema field names / value types are got populated

A raw query over HTTP, the http tool is from httpie, just an advanced version of curl; after I get access token with same httpie tool from token allocation, I can do queries like this on command line, here the server response included correct schema with nested fields, so technically it should be possible to populate all inner level field names and value types correct

➸ time http --verbose --pretty=colors POST https://www.googleapis.com/bigquery/v2/projects/XXXXXXX/queries 'Authorization: Bearer XXXXXXXXXXXX' useLegacySql=false useQueryCache=false query='SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d'
POST /bigquery/v2/projects/XXXXXXXXXXXXXX/queries HTTP/1.1
Content-Length: 113
Accept-Encoding: gzip, deflate
Host: www.googleapis.com
Accept: application/json, */*
User-Agent: HTTPie/0.9.6
Connection: keep-alive
Content-Type: application/json
Authorization: Bearer XXXXXXXXXXXXXXXXXXXXXX

{"useLegacySql": "false", "useQueryCache": "false", "query": "SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d"}

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Wed, 14 Sep 2016 17:37:03 GMT
ETag: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
Vary: Origin
Vary: X-Origin
Content-Type: application/json; charset=UTF-8
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Alt-Svc: quic=":443"; ma=2592000; v="36,35,34,33,32"
Transfer-Encoding: chunked

{
 "kind": "bigquery#queryResponse",
 "schema": {
  "fields": [
   {
    "name": "d",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
     {
      "name": "a",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "b",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "c",
      "type": "INTEGER",
      "mode": "NULLABLE"
     }
    ]
   }
  ]
 },
 "jobReference": {
  "projectId": "XXXXXXXXXXXXX",
  "jobId": "job_XXXXXXXXXXXXXXXXXXXXXXXXXXX"
 },
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": [
      {
       "v": {
        "f": [
         {
          "v": "1"
         },
         {
          "v": "2"
         },
         {
          "v": "3"
         }
        ]
       }
      }
     ]
    }
   ]
  }
 ],
 "totalBytesProcessed": "0",
 "jobComplete": true,
 "cacheHit": false
}

real    0m3.040s
user    0m0.364s
sys 0m0.020s

Environment details

OS: Linux
Node.js version: v6

I have search other issues here #1564 #1135 about nested schema, but for different purpose, I believe this ticket is different.

The text was updated successfully, but these errors were encountered:

stephenplusplus · 2016-09-14T17:56:17Z

Thanks for the report! We'll definitely dig into this. In the meantime, you should be able to access the same data returned from httpie with the 3rd param to the query callback:

BigQuery().query('...', {...}, function(err, data, apiResponse) {
  // apiResponse === http response body
})

c0b · 2016-09-14T18:03:48Z

Another thing confused me is there seems multiple bigquery nodejs binding, above code uses const BigQuery = require('@google-cloud/bigquery'); var bigquery = BigQuery(); bigquery.query(...), there are other sample code uses require('google-cloud').bigquery('v2') and get a readable stream API that needs to listen on 'data' event and process each one, there is even a 3rd one is googleapis npm package, that uses var google = require('googleapis'); var bigquery = google.bigquery('v2'); and bigquery.jobs.query(...) all these I have tried so far have a same problem of inner fields not populated;

But could you clarify which library is officially supported by Google? or which one is latest? and could you mark others as deprecated?

> const gcloud = require('google-cloud');
> var bigquery = gcloud.bigquery('v2');
undefined
> bigquery.query({ query: 'SELECT STRUCT(1 AS a, 2 AS b, 3 AS c) AS d', useLegacySql: false }).on('data', rows => { console.dir(rows, { depth: null, colors: true }); } );
DestroyableTransform {
  _readableState: 
   ReadableState {
     objectMode: true,
     highWaterMark: 16,
     buffer: [],
   ...
> { d: { f: [ { v: '1' }, { v: '2' }, { v: '3' } ] } }

c0b · 2016-09-14T18:08:32Z

the google/google-api-nodejs-client @google has bigquery support as well, but different APIs, which one should I use?

stephenplusplus · 2016-09-14T18:09:12Z

https://googlecloudplatform.github.io/google-cloud-node/#/:

What is the relationship between google-cloud and the Google APIs Node.js Client?

The Google APIs Node.js Client is a client library for using the broad set of Google APIs. google-cloud is built specifically for the Google Cloud Platform and is the recommended way to integrate Google Cloud APIs into your Node.js applications. If your application requires both Google Cloud Platform and other Google APIs, the 2 libraries may be used by your application.

So, this is the handmade API. googleapis is auto-generated. We have two different ways you can use BigQuery: with the all-in-one module, google-cloud. This ships with every API we support in this library. If you don't want the extra APIs you're not using, you can just install the BigQuery API: @google-cloud/bigquery.

require('google-cloud').bigquery('v2')

We don't allow specifying a version to use in this library, we always use the latest. The v2 parameter is being ignored.

c0b · 2016-09-14T18:16:07Z

// apiResponse === http response body

that would be a workaround before this issue is resolved, that means to parse the schema by myself; but I prefer wait this issue to be resolved

stephenplusplus · 2016-09-14T18:27:08Z

Would you mind getting the JSON output from a complicated query and running it through here:

BigQuery().query('...', {}, function(err, data, apiResponse) {
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
});

function test(schema, rows) {
  return rows.map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          if (!is.nil(value)) {
            value = parseFloat(value);
          }
          break;
        }
        case 'INTEGER': {
          if (!is.nil(value)) {
            value = parseInt(value, 10);
          }
          break;
        }
        case 'RECORD': {
          value = Table.mergeSchemaWithRows_(schemaField, [value[0].v]);
          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

If that works as expected, I'll send a PR.

stephenplusplus · 2016-09-14T18:30:57Z

I'm pretty sure it's going to need tweaking, so if you could provide the raw JSON response body from what you think is the most complicated query, I'll be in a better place to test. Thanks!

c0b · 2016-09-16T02:47:07Z

I am testing python api as well, that is able to parse some inner level structure, but the value type is still not right:

When I put a wrong SQL, it is able to infer the inner struct is STRUCT<STRING, INT64, STRING>

$ glcoud-sdk/.../path/to/bq.py --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, "7a")] d'
Error in query string: Error processing job 'XXXXXXXX': Array elements of types
{STRUCT<STRING, INT64, STRING>, STRUCT<id STRING, a INT64, b INT64>} do not have a common supertype at [1:8]

When I put a correct SQL, it gives me this output:

bq.py --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d'
[
  {
    "d": [
      {
        "a": "1", 
        "b": "2", 
        "id": "zx83"
      }, 
      {
        "a": "4", 
        "b": "7", 
        "id": "f8f7"
      }
    ]
  }
]

the fields a, b is already known as INT, should be a number in JSON and please keep the order, I know this is a bad problem of Python's default JSON parser, I guess it internally called json.loads to parse an object to python's dict and output by json.dumps that dict is completely unordered;

Javascript's JSON.parse or JSON.stringify doesn't have the problem, that comes in right order; while I am expecting it returns the right value type

// expected output
[
  {
    "d": [
      {
        "id": "zx83"
        "a": 1,
        "b": 2,
      },
      {
        "id": "f8f7"
        "a": 4,
        "b": 7,
      }
    ]
  }
]

stephenplusplus · 2016-09-16T14:03:29Z

Interesting. If something unexpected is happening with that library, I'll have to ask someone where we can report the bug. Off-hand, I'm not sure.

As far as ordering of the results, we will keep it in the order the HTTP JSON API returns it to us. The only extra step we do is combining the data and transforming it back to the native type of the result.

To make sure my attempted solution above can handle even the most complex queries, would you mind running one that you think might matches, then send over the JSON response body from the API? I appreciate your help debugging so far, sorry for asking for more!

c0b · 2016-09-23T03:24:11Z

I've just got some cycles to test on this, are you sure the 3rd argument apiResponse exist?

I'm following examples in this doc: which says nextQuery is 3rd, and apiResponse is the 4th,

https://googlecloudplatform.github.io/google-cloud-node/#/docs/google-cloud/0.40.0/bigquery?method=query

const bigquery = BigQuery();

bigquery.query({
  query: 'SELECT ARRAY[STRUCT("fa23" AS id, 1 AS a, 2 AS b, 3.4 AS c), ("bd23", 2, 3, 3.14)] AS d',
  useLegacySql: false
}, (err, rows, nextQuery, apiResponse) => {
  if (err) {
    return console.error(err);
  }

  console.log(arguments.length);
  console.dir(rows, { depth: null, colors: true });
  console.dir({ err, rows, nextQuery, apiResponse },
              { depth: null, colors: true });
});

but the code get this output: I don't see either nextQuery or apiResponse; does it require any option to be set?

5
[ { d: 
     [ { v: { f: [ { v: 'fa23' }, { v: '1' }, { v: '2' }, { v: '3.4' } ] } },
       { v: { f: [ { v: 'bd23' }, { v: '2' }, { v: '3' }, { v: '3.14' } ] } } ] } ]
{ err: null,
  rows: 
   [ { d: 
        [ { v: { f: [ { v: 'fa23' }, { v: '1' }, { v: '2' }, { v: '3.4' } ] } },
          { v: { f: [ { v: 'bd23' }, { v: '2' }, { v: '3' }, { v: '3.14' } ] } } ] } ],
  nextQuery: undefined,
  apiResponse: undefined }

I've tried your way to use query string as 1st, options as 2nd parameter, but it seems ignoring useLegacySql, always reporting invalid SQL (it parsed as legacy SQL)

BigQuery().query('SELECT STRUCT(...)', { useLegacySql: false }, function(err, data, apiResponse) {
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
});

does the library support a NODE_DEBUG=... variable to dump api server interaction details?

stephenplusplus · 2016-09-23T14:24:55Z

Sorry, my earlier example was wrong in regards to apiResponse existing by default. The JSON APIs return responses to us in a paginated format. By default, our library will make as many API requests as necessary to get all of the results before executing your callback. In order to receive an apiResponse argument, you must shut off this feature with autoPaginate: false.

Here's another version that will run all of the queries and return the results to your callback once, but also collect all of the apiResponses. Please give this a shot:

BigQuery().query('SELECT STRUCT(...)', {
  useLegacySql: false,
  autoPaginate: false
}, function(err, rows, nextQuery, apiResponse) {
  if (err) throw err
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
})

function test(schema, rows) {
  return rows.map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          if (!is.nil(value)) {
            value = parseFloat(value);
          }
          break;
        }
        case 'INTEGER': {
          if (!is.nil(value)) {
            value = parseInt(value, 10);
          }
          break;
        }
        case 'RECORD': {
          value = Table.mergeSchemaWithRows_(schemaField, [value[0].v]);
          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

c0b · 2016-09-24T00:37:09Z

this code works for a simple struct, but I believe a recursive call is needed somewhere, for deeply nested structures, like

  apiResponse: 
   { kind: 'bigquery#queryResponse',
     schema: 
      { fields: 
         [ { name: 'd',
             type: 'RECORD',
             mode: 'REPEATED',
             fields: 
              [ { name: 'id', type: 'STRING', mode: 'NULLABLE' },
                { name: 'a', type: 'INTEGER', mode: 'NULLABLE' },
                { name: 'b', type: 'INTEGER', mode: 'NULLABLE' },
                { name: 'c', type: 'FLOAT', mode: 'NULLABLE' },
                { name: 'f',
                  type: 'RECORD',
                  mode: 'NULLABLE',
                  fields: [ { name: 'a', type: 'INTEGER', mode: 'NULLABLE' } ] } ] } ] },
     rows: 
      [ { f: 
           [ { v: 
                [ { v: 
                     { f: 
                        [ { v: 'fa23' },
                          { v: '1' },
                          { v: '2' },
                          { v: '3.4' },
                          { v: { f: [ { v: '1' } ] } } ] } },
                  { v: 
                     { f: 
                        [ { v: 'bd23' },
                          { v: '2' },
                          { v: '3' },
                          { v: '3.14' },
                          { v: { f: [ { v: '3' } ] } } ] } } ] } ] } ],
     totalBytesProcessed: '0',
     jobComplete: true,
     cacheHit: false } }

[ { d: [ { id: 'fa23', a: 1, b: 2, c: 3.4, f: { f: [ { v: '1' } ] } } ] } ]          // this is currently printed, it seems only first element is shown for an array

This is from python bq tool:

[
  {
    "d": [
      {
        "a": "1", 
        "b": "2", 
        "c": "3.4", 
        "f": {
          "a": "1"
        }, 
        "id": "fa23"
      }, 
      {
        "a": "2", 
        "b": "3", 
        "c": "3.14", 
        "f": {
          "a": "3"
        }, 
        "id": "bd23"
      }
    ]
  }
]

Expected is:

[
  {
    "d": [
      {
        "id": "fa23",
        "a": 1,
        "b": 2,
        "c": 3.4,
        "f": {
          "a": 1
        }
      },
      {
        "id": "bd23",
        "a": 2,
        "b": 3,
        "c": 3.14,
        "f": {
          "a": 3
        },
      }
    ]
  }
]

stephenplusplus · 2016-09-24T00:49:55Z

Thanks for the snippet, I'll get to work on a recursive solution.

c0b · 2016-09-24T01:04:31Z

the query is bq --format=prettyjson query --use_legacy_sql=false 'SELECT ARRAY[STRUCT("fa23" AS id, 1 AS a, 2 AS b, 3.4 AS c, STRUCT(1 AS a) AS f), ("bd23", 2, 3, 3.14, STRUCT(3))] AS d'

stephenplusplus · 2016-09-24T02:10:51Z

Okay, please try:

console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))

function test(schema, rows) {
  return (Array.isArray(rows) ? rows : [rows]).map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          value = parseFloat(value);
          break;
        }
        case 'INTEGER': {
          value = parseInt(value, 10);
          break;
        }
        case 'RECORD': {
          if (schemaField.mode === 'REPEATED') {
            value = value.map(function(val) {
              return test(schemaField, val.v).pop();
            });
          } else {
            value = test(schemaField, value).pop();
          }

          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

Feel free to run other responses against it as well.

c0b · 2016-09-28T23:45:47Z

I'm re-using this code on a table.getRows; if a table is defined with nested fields, this function returns all objects with nested f and v so this recursive decode is also needed,

https://googlecloudplatform.github.io/google-cloud-node/#/docs/bigquery/0.3.0/bigquery%2Ftable?method=getRows

found 2 problems:

table.getRows's apiResponse doesn't have schema, so I have to call a table.getMetadata first, can this be called by table.getRows automatically? or have a parameter to let it return schema in apiResponse?
fields are by default NULLABLE, including RECORD can also be a NULL; I made a little change for it to be working:

function test(schema, rows) {
  return (Array.isArray(rows) ? rows : [rows]).map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      var fieldObject = {};

      // skip any schema decoding if it's null
      if (value === null) {
        fieldObject[schemaField.name] = null;
        return fieldObject;
      }

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          value = parseFloat(value);
          break;
        }
        case 'INTEGER': {
          value = parseInt(value, 10);
          break;
        }
        case 'RECORD': {
          if (schemaField.mode === 'REPEATED') {
            value = value.map(function(val) {
              return test(schemaField, val.v).pop();
            });
          } else {
            value = test(schemaField, value).pop();
          }

          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

stephenplusplus · 2016-09-29T15:35:42Z

Thanks! I'll put that into a PR and ping you for another look when it's ready.

stephenplusplus · 2016-09-29T17:10:29Z

@c0b - please take a look at #1648.

table.getRows's apiResponse doesn't have schema, so I have to call a table.getMetadata first, can this be called by table.getRows automatically? or have a parameter to let it return schema in apiResponse?

We actually do that automatically within getRows(), and the schema goes onto the table's instance property, table.metadata.schema. To test from your callback to getRows, you can try:

table.getRows(function(err, rows, apiResponse) {
  console.log(JSON.stringify(test(table.metadata.schema, apiResponse.rows), null, 2))
});

stephenplusplus added the api: bigquery Issues related to the BigQuery API. label Sep 14, 2016

c0b mentioned this issue Sep 17, 2016

bigquery: please support non-legacy-sql nested data type, and preserve order in json googleapis/google-cloud-python#2342

Closed

stephenplusplus mentioned this issue Sep 29, 2016

bigquery: convert nested objects to their native types #1648

Merged

callmehiphop closed this as completed in #1648 Oct 14, 2016

stephenplusplus mentioned this issue Nov 28, 2016

Support (non-legacy) SQL nested data types googleapis/google-cloud-python#2354

Closed

JustinBeckwith assigned yoshi-automation Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

standard bigquery v2, support nested schema in query result #1593

standard bigquery v2, support nested schema in query result #1593

c0b commented Sep 14, 2016 •

edited by stephenplusplus

Loading

stephenplusplus commented Sep 14, 2016

c0b commented Sep 14, 2016

c0b commented Sep 14, 2016 •

edited

Loading

stephenplusplus commented Sep 14, 2016

c0b commented Sep 14, 2016 •

edited

Loading

stephenplusplus commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016

c0b commented Sep 16, 2016 •

edited

Loading

stephenplusplus commented Sep 16, 2016

c0b commented Sep 23, 2016 •

edited

Loading

stephenplusplus commented Sep 23, 2016

c0b commented Sep 24, 2016

stephenplusplus commented Sep 24, 2016

c0b commented Sep 24, 2016

stephenplusplus commented Sep 24, 2016

c0b commented Sep 28, 2016

stephenplusplus commented Sep 29, 2016

stephenplusplus commented Sep 29, 2016

standard bigquery v2, support nested schema in query result #1593

standard bigquery v2, support nested schema in query result #1593

Comments

c0b commented Sep 14, 2016 • edited by stephenplusplus Loading

Environment details

stephenplusplus commented Sep 14, 2016

c0b commented Sep 14, 2016

c0b commented Sep 14, 2016 • edited Loading

stephenplusplus commented Sep 14, 2016

c0b commented Sep 14, 2016 • edited Loading

stephenplusplus commented Sep 14, 2016

stephenplusplus commented Sep 14, 2016

c0b commented Sep 16, 2016 • edited Loading

stephenplusplus commented Sep 16, 2016

c0b commented Sep 23, 2016 • edited Loading

stephenplusplus commented Sep 23, 2016

c0b commented Sep 24, 2016

stephenplusplus commented Sep 24, 2016

c0b commented Sep 24, 2016

stephenplusplus commented Sep 24, 2016

c0b commented Sep 28, 2016

stephenplusplus commented Sep 29, 2016

stephenplusplus commented Sep 29, 2016

c0b commented Sep 14, 2016 •

edited by stephenplusplus

Loading

c0b commented Sep 14, 2016 •

edited

Loading

c0b commented Sep 14, 2016 •

edited

Loading

c0b commented Sep 16, 2016 •

edited

Loading

c0b commented Sep 23, 2016 •

edited

Loading