Skip to content

STRUCT getitem operator creates unnecessary parens when chained with other types #1148

Open
@r1b

Description

@r1b

Summary

This is illustrated in test__struct.py. Here's a sample failing test that captures how I would expect this to behave:

import sqlalchemy

def test_struct_parens_demo(faux_conn):
    from sqlalchemy_bigquery import ARRAY, STRUCT
    table = sqlalchemy.Table(
        "t",
        sqlalchemy.MetaData(),
        sqlalchemy.Column("person", STRUCT(children=ARRAY(STRUCT(name=sqlalchemy.String)))),
    )
    expr = table.c.person.children[0].name
    expected_sql = f"SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` \nFROM `t`"
    actual_sql = str(sqlalchemy.select(expr).compile(faux_conn.engine))
    assert actual_sql == expected_sql
AssertionError: assert 'SELECT ((`t`...1` \nFROM `t`' == 'SELECT `t`.`...1` \nFROM `t`'
  
  - SELECT `t`.`person`.children[OFFSET(%(param_1:INT64)s)].name AS `anon_1` 
  + SELECT ((`t`.`person`.children)[OFFSET(%(param_1:INT64)s)]).name AS `anon_1` 
  ?        ++                     +                           +
    FROM `t`

I believe this is happening because the STRUCT type is taking on the lowest possible operator precedence in SQLAlchemy, when it actually has the highest operator precedence in BigQuery.

This doesn't create any functional issue, but it does make it hard to read the rendered SQL. For example, I work on a system where we run snapshot tests on the SQL generated by the bigquery dialect. We also often consult the generated SQL in the BigQuery console when debugging.


Investigation

In _struct.py we make SQLAlchemy aware of struct_getitem_op by patching the default comparator to alias json_getitem_op:

sqlalchemy.sql.default_comparator.operator_lookup[
    struct_getitem_op.__name__
] = sqlalchemy.sql.default_comparator.operator_lookup["json_getitem_op"]

But it seems this is not the only mapping for operators. We also have _PRECEDENCE, which determines when parens are necessary.

If the target operator does not exist in _PRECEDENCE, it takes on the lowest possible precedence.

So this would explain why we get unnecessary parens. Note that I'm calling these parens "unnecessary" because I see that all indexed types in BigQuery have the same precedence (the highest).

Proposal

I guess we could also patch _PRECEDENCE, but I don't think that its necessary to patch SQLAlchemy internals to accomplish this. Here's an alternative: route both ARRAY and STRUCT indexing operations to getitem.

I think the intention for the builtin getitem operator is that it can handle any indexing operation, not necessarily just for the builtin ARRAY type. It seems possible to tweak visit_getitem_binary to render based on the type of the LHS of the indexing operation:

     def visit_getitem_binary(self, binary, operator_, **kw):
         left = self.process(binary.left, **kw)
         if binary.left.type._type_affinity is _struct.STRUCT:
             return f"{left}.{binary.right.value}"
         right = self.process(binary.right, **kw)
         return f"{left}[OFFSET({right})]"

Routing indexing operations to getitem also ensures that they have the same precedence, so the parens go away.

I poked around and found a few examples of other dialects checking _type_affinity in visit hooks (just to be sure we're not exchanging one internals reference for another) - here's a sample from mysql.

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the googleapis/python-bigquery-sqlalchemy API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions