-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: rename TABLE_NAMES_CACHE_CONFIG to DATA_CACHE_CONFIG #11509
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11509 +/- ##
==========================================
+ Coverage 62.86% 67.12% +4.26%
==========================================
Files 889 889
Lines 43055 43062 +7
Branches 4017 4017
==========================================
+ Hits 27065 28905 +1840
+ Misses 15811 14060 -1751
+ Partials 179 97 -82
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
from superset.common.query_object import QueryObject | ||
from superset.connectors.base.models import BaseDatasource | ||
from superset.connectors.connector_registry import ConnectorRegistry | ||
from superset.exceptions import QueryObjectValidationError | ||
from superset.extensions import cache_manager, security_manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import the cache_manager
instead of cache
to make it clear which cache we are using.
@@ -285,7 +286,11 @@ def get_df_payload( # pylint: disable=too-many-statements | |||
status = utils.QueryStatus.FAILED | |||
stacktrace = utils.get_stacktrace() | |||
|
|||
if is_loaded and cache_key and cache and status != utils.QueryStatus.FAILED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and cache
is not needed because of #9020
@@ -450,8 +451,8 @@ def inspector(self) -> Inspector: | |||
return sqla.inspect(engine) | |||
|
|||
@cache_util.memoized_func( | |||
key=lambda *args, **kwargs: "db:{}:schema:None:table_list", | |||
attribute_in_key="id", | |||
key=lambda self, *args, **kwargs: f"db:{self.id}:schema:None:table_list", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small refactor to make the memoize function more flexible.
|
||
def etag_cache( | ||
max_age: int, check_perms: Callable[..., Any], cache: Cache = cache_manager.cache | ||
) -> Callable[..., Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved over from superset.utils.decorators
superset/utils/cache_manager.py
Outdated
self._thumbnail_cache = Cache() | ||
|
||
def init_app(self, app: Flask) -> None: | ||
self._cache.init_app(app, app.config["CACHE_CONFIG"]) | ||
self._tables_cache.init_app(app, app.config["TABLE_NAMES_CACHE_CONFIG"]) | ||
self._data_cache.init_app(app, app.config["DATA_CACHE_CONFIG"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't add backward compatibility because the data objects put into this cache will become much much larger. If a Superset admin has configured in-memory cache for TABLE_NAMES_CACHE_CONFIG
, they might run into problems. Therefore it's better to make things break to make them aware of this change.
For most users, the migration is as simple as renaming the config value.
I will be on PTO next week, won't have a change to give a full review. 1 suggestion it would be nice to add a unit test that will document the expected behavior and will prevent from the regressions, tests can impersonate the user to hit various endpoints. |
I added two test assertions in cache_tests.py. Do you think we need more? |
45ce586
to
c0eb415
Compare
self._cache.init_app( | ||
app, | ||
{ | ||
"CACHE_DEFAULT_TIMEOUT": app.config["CACHE_DEFAULT_TIMEOUT"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure the global default is applied as documented.
93729d6
to
9543b47
Compare
c2a46af
to
ca25463
Compare
@john-bodley @etr2460 So previously the test Not sure why this cache would fail |
Here's a test run that fails when the only change comparing to |
@@ -95,7 +95,7 @@ def test_get_items(self): | |||
"function_names", | |||
"id", | |||
] | |||
self.assertEqual(response["count"], 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count = 2
seems to assume some tests run before this test. If you just initialized the database and run this test alone, this test will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please resolve conflicts. otherwise LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however I didn't quite see how the "timeout search path" happens, is that something happening outside this PR in local configs or implicitly via Flask-Caching
?
if cache_manager.tables_cache: | ||
|
||
def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any: | ||
if not kwargs.get("cache", True): | ||
return f(self, *args, **kwargs) | ||
|
||
if attribute_in_key: | ||
cache_key = key(*args, **kwargs).format( | ||
getattr(self, attribute_in_key) | ||
) | ||
else: | ||
cache_key = key(*args, **kwargs) | ||
o = cache_manager.tables_cache.get(cache_key) | ||
if not kwargs.get("force") and o is not None: | ||
return o | ||
o = f(self, *args, **kwargs) | ||
cache_manager.tables_cache.set( | ||
cache_key, o, timeout=kwargs.get("cache_timeout") | ||
) | ||
return o | ||
|
||
else: | ||
# noop | ||
def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any: | ||
def wrapped_f(self: Any, *args: Any, **kwargs: Any) -> Any: | ||
if not kwargs.get("cache", True): | ||
return f(self, *args, **kwargs) | ||
|
||
cache_key = key(self, *args, **kwargs) | ||
obj = cache.get(cache_key) | ||
if not kwargs.get("force") and obj is not None: | ||
return obj | ||
obj = f(self, *args, **kwargs) | ||
cache.set(cache_key, obj, timeout=kwargs.get("cache_timeout")) | ||
return obj | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
I think it is referring to this logic here: https://github.com/apache/incubator-superset/blob/600a6fa92a0bbe5bfd93371db5ced6af556c3697/superset/viz.py#L423-L433 |
Ah sorry, I misinterpreted the comment, never mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good layer of optional additional security, thanks for implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great
0811d14
to
60283e2
Compare
The corresponding cache will now also cache the query results.
SUMMARY
💥 Breaking Change
Rename config value
TABLE_NAMES_CACHE_CONFIG
toDATA_CACHE_CONFIG
, and save datasource query results to this cache, too.Motivation
Companies like Airbnb have different security levels for Superset itself and the datasources Superset connects to: anyone can login to the same Superset, but not everyone have access to all data. Superset has
SecurityManager
to check whether a user has access to certain datasource, but all query results are still stored in the same cache, regardless of users' access levels. If a malicious party obtained access to the cache server via certain Superset user in a high-risk region, they get access to all cached data.By separating cache storage for Superset metadata and the actual data users consume, we can have a better insulation of sensitive data by configuring different cache backend for high-risk regions while keeping the cache for all the Superset metadata in sync.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A
TEST PLAN
CI
ADDITIONAL INFORMATION