Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make schema name for the CTA queries and limit configurable #8867

Merged
merged 13 commits into from
Mar 3, 2020

Conversation

bkyryliuk
Copy link
Member

@bkyryliuk bkyryliuk commented Dec 19, 2019

Description

This PR implements an ability to configure the logic that generates the schema name for the create table as queries based on the database, user, used schema and executed SQL.
There are 2 new parameters in the config, both are off by default.

# Function accepts database object, user object, schema name and sql that will be run.
SQLLAB_CTA_SCHEMA_NAME_FUNC = (
    None
)  # type: Optional[Callable[["Database", "User", str, str], str]]
# Flag that controls if limit should be enforced on the CTA (create table as queries).
SQLLAB_CTA_NO_LIMIT = False

Use cases

  • automatically put sensitive data into the sensitive schema and that data is accessed
  • put user CTA results into the user personal schema, it can be extended to the team schema's as well

Other changes

Tests

[x] unit test + integration tests
[x] tested locally on mysql & postgres
[x] is used in production @ dropbox for over a month

@request-info
Copy link

request-info bot commented Dec 19, 2019

We would appreciate it if you could provide us with more info about this issue/pr! Please do not leave the title or description empty.

@request-info request-info bot added the need:more-info Requires more information from author label Dec 19, 2019
@bkyryliuk bkyryliuk force-pushed the bogdan/cta_userschema branch 17 times, most recently from fabcf75 to 99b5b6c Compare December 20, 2019 22:27
@@ -366,6 +368,11 @@ def execute_sql_statements(
payload = handle_query_error(msg, query, session, payload)
return payload

# Commit the connection so CTA queries will create the table.
# TODO(bk): consider if it's only needed for postgres.
if conn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would conn be falsy?

@@ -222,7 +222,10 @@ def get_query_with_new_limit(self, new_limit: int) -> str:
limit_pos = pos
break
_, limit = statement.token_next(idx=limit_pos)
if limit.ttype == sqlparse.tokens.Literal.Number.Integer:
# Override the limit only when it exceeds the configured value.
if limit.ttype == sqlparse.tokens.Literal.Number.Integer and new_limit < int(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh, this here in theory changes what the function is expected to do ("returns the query with the specified limit"), so either we change the name/docstring to reflect that, or either we move the conditional logic towards where the function is called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docstring, yeah there is a mismatch. It's hard to move this condition outside of this function as it needs to get the existing limit value from the query and would require to parse query twice. I can't think about the usecase where we would want to override lower user limit with the higher configured value, e.g. I would expect to see 1 row when I query select * from bla limit 1 rather than 100.

# Flag that controls if limit should be inforced on the create table as queries.
SQLLAB_CTA_NO_LIMIT = False

# Function accepts username, schema name and sql that will be run e.g.:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear here what this does or why you'd want to use this config element. I had to read a bit of code to understand it.

This allows you to define custom logic around the "CREATE TABLE AS" or CTAS feautre in SQL Lab that defines where the target schema should be for a given user.

Then add a proper example with type annotation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the database object as a param too as the policy may differ across databases. Also could be good to pass in the user object instead of the username in case someone would want to dig into roles/perms.

@@ -2591,6 +2600,15 @@ def sql_json_exec(
# Set tmp_table_name for CTA
if select_as_cta and mydb.force_ctas_schema:
tmp_table_name = f"{mydb.force_ctas_schema}.{tmp_table_name}"
elif select_as_cta:
dest_schema_name = get_cta_schema_name(
schema, sql, g.user.username if g.user else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g.user will only be there if you're in sync mode (inside the scope of a web request), won't work on Celery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

views/core.py is in the scope of the web request, this happens before query object is created and passed to the celery worker - we should be safe here.

@@ -2644,8 +2662,9 @@ def sql_json_exec(
)

# set LIMIT after template processing
limits = [mydb.db_engine_spec.get_limit_from_sql(rendered_query), limit]
query.limit = min(lim for lim in limits if lim is not None)
if not config.get("SQLLAB_CTA_NO_LIMIT", False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assume that the key exists no need for default to False here, and if would be False anyways if the key doesn't exist

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point :)

@bkyryliuk bkyryliuk force-pushed the bogdan/cta_userschema branch 5 times, most recently from c52c861 to 47c7e38 Compare December 26, 2019 23:00
@bkyryliuk bkyryliuk force-pushed the bogdan/cta_userschema branch 2 times, most recently from 861a68d to 07787c2 Compare February 20, 2020 19:28
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactor! My only remaining issue is with the following scenarios:

  1. user enters a fully qualified name in the table field (highlighted in you TODO comment)
  2. user has chars in CTAS schema/table name that require quotes in fully qualified name.

I think we can live with these restrictions for now, so leaning towards getting this merged soon. Let me test this locally before final approval.


def upgrade():
op.add_column(
"query", sa.Column("tmp_schema_name", sa.String(length=256), nullable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not to other reviewers) At first I was confused by the choice of column name here, but turns out there was already tmp_table_name for the CTAS target target table. A more appropriate name might be target_table_name or ctas_table_name, but no point in changing the current convention in this PR.

# sqlite doesn't support schemas
return
tmp_table_name = "tmp_async_22"
expected_full_table_name = f"{CTAS_SCHEMA_NAME}.{tmp_table_name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to the topic of quoting.. select_star in BaseEngineSpec quotes schema and table_name, and I'm wondering if we shouldn't do that here (views/core.py:Superset.sql_json_exec), too. If not, then we just need to assume users won't be doing CTAS into schemas/tables with periods, which seems like a reasonable assumption, for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to either, quote while generating SQL or quote early on when getting the user input.
I think both approaches are good, the latter one would be a bit involved as table object creation should be quoted for SQLATable, Column and Metric. This test case just demonstrates the existing behavior, modifying it probably would be out of scope of this change, but definitely a useful improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not push too much complexity into this PR, we can deal with this later.

@bkyryliuk bkyryliuk force-pushed the bogdan/cta_userschema branch from 07787c2 to 5117648 Compare February 24, 2020 18:04
Fixing unit tests

Fix table quoting

Mypy

Split tests out for sqlite

Grant more permissions for mysql user

Postgres doesn't support if not exists

More logging

Commit for table creation

Priviliges for postgres

Update tests

Resolve comments

Lint

No limits for the CTA queries if configures
@bkyryliuk bkyryliuk force-pushed the bogdan/cta_userschema branch from 5117648 to 6661e39 Compare February 26, 2020 18:27
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and works well. Had missed the type comments before, so would like to get those fixed; beyond that this looks good to go for me.

Comment on lines 551 to 553
SQLLAB_CTA_SCHEMA_NAME_FUNC = (
None
) # type: Optional[Callable[["Database", "models.User", str, str], str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit, but as Superset has deprecated support for python 2.7, we prefer regular type annotations over type comments, i.e.

SQLLAB_CTA_SCHEMA_NAME_FUNC: Optional[
    Callable[["Database", "models.User", str, str], str]
] = None

Comment on lines 250 to 252
func = config.get(
"SQLLAB_CTA_SCHEMA_NAME_FUNC"
) # type: Optional[Callable[[Database, ab_models.User, str, str], str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Also, we prefer using square brackets when reading configs, i.e. config["SQLLAB_CTA_SCHEMA_NAME_FUNC"] to avoid having to check for None (doesn't really apply in this case, but it's a convention nonetheless).

# Set tmp_schema_name for CTA
# TODO(bkyryliuk): consider parsing, splitting tmp_schema_name from tmp_table_name if user enters
# <schema_name>.<table_name>
tmp_schema_name = schema # type: Optional[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @mistercrunch @john-bodley @dpgaspar a second opinion would be good here, as the review process has been fairly lengthy with lots of back and forth, possibly resulting in us overlooking something important.

:param overwrite: table_name will be dropped if true
:return: Create table as query
"""
exec_sql = ""
sql = self.stripped()
# TODO(bkyryliuk): quote full_table_name
full_table_name = f"{schema_name}.{table_name}" if schema_name else table_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t we address the TODO? Note the quoter needs to be dialect specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley this would be an additional feature. I kept the logic as it was before.
It is worth to resolve this todo and it is existing bug in superset, but I think this PR is not a right place for the fix as it is already quite large & hard to review and comprehend.

@@ -64,8 +64,10 @@ jobs:
- redis-server
before_script:
- mysql -u root -e "DROP DATABASE IF EXISTS superset; CREATE DATABASE superset DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci"
- mysql -u root -e "DROP DATABASE IF EXISTS sqllab_test_db; CREATE DATABASE sqllab_test_db DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not clear from this PR why we need these additional two databases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley this is purely for testing, e.g. there is a need to have 2 different schemas in the mysql & postgres to test the CTA behavior

@villebro
Copy link
Member

As this PR includes a db migration, it would be nice to merge soon if there are no objections. I'll leave this open for a few more days, then merging if no further comments surface.

@villebro villebro merged commit 4e1fa95 into apache:master Mar 3, 2020
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels need:more-info Requires more information from author size/L 🚢 0.36.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants