-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: helper functions for RLS #19055
Conversation
Codecov Report
@@ Coverage Diff @@
## master #19055 +/- ##
==========================================
+ Coverage 66.56% 66.61% +0.05%
==========================================
Files 1641 1641
Lines 63495 63583 +88
Branches 6425 6425
==========================================
+ Hits 42265 42358 +93
+ Misses 19550 19545 -5
Partials 1680 1680
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
superset/sql_parse.py
Outdated
@@ -458,3 +460,178 @@ def validate_filter_clause(clause: str) -> None: | |||
) | |||
if open_parens > 0: | |||
raise QueryClauseValidationException("Unclosed parenthesis in filter clause") | |||
|
|||
|
|||
def has_table_query(statement: Statement) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betodealmeida there's also this example which has logic for identifying tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember the details, but I've had issues with that example code before — I think it failed to identify table names when they were considered keywords (even though the example calls it out).
superset/sql_parse.py
Outdated
""" | ||
seen_source = False | ||
tokens = statement.tokens[:] | ||
while tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You likely could just do for token in stmt.flatten():
and remove the logic from lines 483–485.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.flatten()
is a bit different in that it returns the leaf nodes only, converting an identifier into 1+ Name
tokens:
>>> list(sqlparse.parse('SELECT * FROM my_table')[0].flatten())
[<DML 'SELECT' at 0x10FF019A0>, <Whitespace ' ' at 0x10FF01D00>, <Wildcard '*' at 0x10FF01D60>, <Whitespace ' ' at 0x10FF01DC0>, <Keyword 'FROM' at 0x10FF01E20>, <Whitespace ' ' at 0x10FF01E80>, <Name 'my_tab...' at 0x10FF01EE0>]
Since I'm looking for identifiers after a FROM
or JOIN
I thought it was easier to implement a traversal logic that actually inspects the parents, not just the leaves.
superset/sql_parse.py
Outdated
|
||
if token.ttype == Keyword and token.value.lower() in ("from", "join"): | ||
seen_source = True | ||
elif seen_source and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge here is there's no strong connection to ensure that the consecutive (or near consecutive) tokens are those which are being identified here. I guess the question is how robust do we want this logic. The proposed solution may well we suffice.
The correct way of doing this is more of a tree traversal (as opposed to a flattened list) where one checks the next token (which could be a group) from the FROM
or JOIN
keyword and iterate from there.
My sense is that can likely be addressed later. We probably need to cleanup the sqlparse
logic to junk it completely in favor of something else given that it seems like the package is somewhat on life support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in the insert_rls
function I had to implement tree traversal to get it right. Let me give it a try rewriting this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-bodley I reimplemented it following the same logic as insert_rls
(recursive tree traversal instead of flattening).
Modify a RLS expression ensuring columns are fully qualified. | ||
""" | ||
tokens = rls.tokens[:] | ||
while tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You likely could use flatten
here. It uses a generator so likely a copy should be made given you're mutating the tokens, i.e.,
for token in list(rls.flatten()):
if imt(token, i=Identifier) and token.get_parent_name() is None:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue, if we call .flatten()
we would never get an Identifier
.
superset/sql_parse.py
Outdated
i, Where([Token(Keyword, "WHERE"), Token(Whitespace, " "), rls]), | ||
) | ||
|
||
# Right pad with space, if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does sqlparse even tokenize whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because it makes it easier to convert the parse tree back to a string. Not sure.
@suddjian I modified the logic to always include the RLS even if it's already present, since there are a few corner cases that are hard to identify. For example, if we have the RLS SELECT * FROM table
WHERE TRUE OR user_id=1 Even though we already have the token SELECT * FROM table
WHERE TRUE OR user_id=1 AND user_id=1 More importantly, because of the precedence of SELECT * FROM table
WHERE (TRUE OR user_id=1) AND user_id=1 Without parenthesis the predicate evaluates to I implemented the logic to wrap the original predicate and added tests covering it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! A few comments with a potential false positive, but other than that looks really good 👍
b2148f9
to
00598fa
Compare
Addressed all of @villebro's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* feat: helper functions for RLS * Add function to inject RLS * Add UNION tests * Add tests for schema * Add more tests; cleanup * has_table_query via tree traversal * Wrap existing predicate in parenthesis * Clean up logic * Improve table matching (cherry picked from commit 8234395)
SUMMARY
This PR introduces 2 helper functions for RLS:
has_table_query(statement: Statement) -> bool:
, which analyzes a SQL statement parsed bysqlparse
and returnsTrue
if it queries a table (either in aFROM
orJOIN
).insert_rls(token_list: TokenList, table: str, rls: TokenList) -> TokenList:
, which given a parsed statement (or token list), a table name, and a parsed RLS expression, will reformat the query so that the RLS is present in any queries or subqueries referencing the table.For example, if we have the RLS expression
id=42
on the tablemy_table
we could do:BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A
TESTING INSTRUCTIONS
I added unit tests covering different cases:
schema.table_name
)ADDITIONAL INFORMATION