Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: helper functions for RLS #19055

Merged
merged 9 commits into from
Mar 11, 2022
Merged

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Mar 8, 2022

SUMMARY

This PR introduces 2 helper functions for RLS:

  • has_table_query(statement: Statement) -> bool:, which analyzes a SQL statement parsed bysqlparse and returns True if it queries a table (either in a FROM or JOIN).
  • insert_rls(token_list: TokenList, table: str, rls: TokenList) -> TokenList:, which given a parsed statement (or token list), a table name, and a parsed RLS expression, will reformat the query so that the RLS is present in any queries or subqueries referencing the table.

For example, if we have the RLS expression id=42 on the table my_table we could do:

>>> statement = sqlparse.parse('SELECT COUNT(*) FROM my_table')[0]
>>> has_table_query(statement)
True
>>> rls = sqlparse.parse('id=42')[0]
>>> print(insert_rls(statement, 'my_table', rls)
SELECT COUNT(*) FROM my_table WHERE my_table.id=42

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

I added unit tests covering different cases:

  • Regular queries
  • Nested queries
  • Joins
  • Union
  • Fully qualified table names (eg, schema.table_name)
  • RLS already in the SQL

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@betodealmeida betodealmeida requested review from villebro, john-bodley and suddjian and removed request for villebro March 8, 2022 02:28
@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #19055 (00598fa) into master (77063cc) will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19055      +/-   ##
==========================================
+ Coverage   66.56%   66.61%   +0.05%     
==========================================
  Files        1641     1641              
  Lines       63495    63583      +88     
  Branches     6425     6425              
==========================================
+ Hits        42265    42358      +93     
+ Misses      19550    19545       -5     
  Partials     1680     1680              
Flag Coverage Δ
hive 52.53% <11.26%> (-0.07%) ⬇️
mysql 81.87% <100.00%> (+0.06%) ⬆️
postgres 81.92% <100.00%> (+0.06%) ⬆️
presto 52.38% <11.26%> (-0.07%) ⬇️
python 82.36% <100.00%> (+0.06%) ⬆️
sqlite 81.67% <100.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/sql_parse.py 98.95% <100.00%> (+0.34%) ⬆️
superset/viz.py 58.26% <0.00%> (ø)
superset/config.py 91.82% <0.00%> (ø)
superset/reports/commands/execute.py 91.51% <0.00%> (ø)
superset/databases/schemas.py 98.47% <0.00%> (+<0.01%) ⬆️
superset/utils/core.py 90.25% <0.00%> (+0.02%) ⬆️
superset/commands/exceptions.py 92.98% <0.00%> (+0.25%) ⬆️
superset/security/manager.py 94.72% <0.00%> (+0.31%) ⬆️
superset/common/query_context_processor.py 91.46% <0.00%> (+0.47%) ⬆️
superset/db_engine_specs/databricks.py 90.90% <0.00%> (+0.90%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77063cc...00598fa. Read the comment docs.

superset/sql_parse.py Outdated Show resolved Hide resolved
@@ -458,3 +460,178 @@ def validate_filter_clause(clause: str) -> None:
)
if open_parens > 0:
raise QueryClauseValidationException("Unclosed parenthesis in filter clause")


def has_table_query(statement: Statement) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida there's also this example which has logic for identifying tables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the details, but I've had issues with that example code before — I think it failed to identify table names when they were considered keywords (even though the example calls it out).

"""
seen_source = False
tokens = statement.tokens[:]
while tokens:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely could just do for token in stmt.flatten(): and remove the logic from lines 483–485.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.flatten() is a bit different in that it returns the leaf nodes only, converting an identifier into 1+ Name tokens:

>>> list(sqlparse.parse('SELECT * FROM my_table')[0].flatten())
[<DML 'SELECT' at 0x10FF019A0>, <Whitespace ' ' at 0x10FF01D00>, <Wildcard '*' at 0x10FF01D60>, <Whitespace ' ' at 0x10FF01DC0>, <Keyword 'FROM' at 0x10FF01E20>, <Whitespace ' ' at 0x10FF01E80>, <Name 'my_tab...' at 0x10FF01EE0>]

Since I'm looking for identifiers after a FROM or JOIN I thought it was easier to implement a traversal logic that actually inspects the parents, not just the leaves.


if token.ttype == Keyword and token.value.lower() in ("from", "join"):
seen_source = True
elif seen_source and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge here is there's no strong connection to ensure that the consecutive (or near consecutive) tokens are those which are being identified here. I guess the question is how robust do we want this logic. The proposed solution may well we suffice.

The correct way of doing this is more of a tree traversal (as opposed to a flattened list) where one checks the next token (which could be a group) from the FROM or JOIN keyword and iterate from there.

My sense is that can likely be addressed later. We probably need to cleanup the sqlparse logic to junk it completely in favor of something else given that it seems like the package is somewhat on life support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in the insert_rls function I had to implement tree traversal to get it right. Let me give it a try rewriting this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley I reimplemented it following the same logic as insert_rls (recursive tree traversal instead of flattening).

Modify a RLS expression ensuring columns are fully qualified.
"""
tokens = rls.tokens[:]
while tokens:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely could use flatten here. It uses a generator so likely a copy should be made given you're mutating the tokens, i.e.,

for token in list(rls.flatten()):
    if imt(token, i=Identifier) and token.get_parent_name() is None:
        ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue, if we call .flatten() we would never get an Identifier.

superset/sql_parse.py Show resolved Hide resolved
superset/sql_parse.py Outdated Show resolved Hide resolved
superset/sql_parse.py Outdated Show resolved Hide resolved
i, Where([Token(Keyword, "WHERE"), Token(Whitespace, " "), rls]),
)

# Right pad with space, if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does sqlparse even tokenize whitespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because it makes it easier to convert the parse tree back to a string. Not sure.

superset/sql_parse.py Outdated Show resolved Hide resolved
superset/sql_parse.py Outdated Show resolved Hide resolved
@betodealmeida
Copy link
Member Author

betodealmeida commented Mar 10, 2022

@suddjian I modified the logic to always include the RLS even if it's already present, since there are a few corner cases that are hard to identify. For example, if we have the RLS user_id=1 and this query:

SELECT * FROM table
WHERE TRUE OR user_id=1

Even though we already have the token Comparison(user_id=1) in the WHERE clause we still need to apply since in this case the comparison is a no-op. So we need to add it:

SELECT * FROM table
WHERE TRUE OR user_id=1 AND user_id=1

More importantly, because of the precedence of AND over OR, we need to wrap the original predicate in parenthesis:

SELECT * FROM table
WHERE (TRUE OR user_id=1) AND user_id=1

Without parenthesis the predicate evaluates to TRUE OR (user_id=1 AND user_id=1), which bypasses the RLS!

I implemented the logic to wrap the original predicate and added tests covering it.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A few comments with a potential false positive, but other than that looks really good 👍

superset/sql_parse.py Outdated Show resolved Hide resolved
superset/sql_parse.py Outdated Show resolved Hide resolved
tests/unit_tests/sql_parse_tests.py Outdated Show resolved Hide resolved
@betodealmeida
Copy link
Member Author

Addressed all of @villebro's comments.

Copy link
Member

@suddjian suddjian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome
https://media.giphy.com/media/Zvgb12U8GNjvq/giphy.gif

@betodealmeida betodealmeida merged commit 8234395 into apache:master Mar 11, 2022
villebro pushed a commit that referenced this pull request Apr 3, 2022
* feat: helper functions for RLS

* Add function to inject RLS

* Add UNION tests

* Add tests for schema

* Add more tests; cleanup

* has_table_query via tree traversal

* Wrap existing predicate in parenthesis

* Clean up logic

* Improve table matching

(cherry picked from commit 8234395)
@mistercrunch mistercrunch added 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/L 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants