Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching charts in Superset #7340

Closed
betodealmeida opened this issue Apr 21, 2019 · 3 comments
Closed

Caching charts in Superset #7340

betodealmeida opened this issue Apr 21, 2019 · 3 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days

Comments

@betodealmeida
Copy link
Member

Summary

This document describes the strategies for caching responses and invalidating them in Superset. They’re current implemented in the following PRs:

I’m sharing this since it seems we didn’t have enough time to discuss the changes in the first PR.

Terminology

  • Native cache refers the cache automatically handled by the browser. This cannot be accessed programmatically, so Cache invalidation #7319 bypasses this cache using the “no-cache” control, delegating cache management to SupersetClient in Feat: improve caching apache-superset/superset-ui#137.
  • Cache API is a new API that allows programmatic access to the browser cache. It allows responses to be inspected and invalidated from Javascript. We also call this the client cache.
  • Server cache refers to a new server-side cache introduced in Fetch charts with GET to benefit from browser cache and conditional requests #7032 which saves response objects, keyed by request URL. It’s used to cache GET requests.
  • Dataframe cache refers to a server-side cache which stores dataframes, keyed by the query object. it’s used for both POST and GET requests.

GET vs. POST

  1. When a user visits a chart or a dashboard, a GET request is made (technically this happens when a Chart component is mounted). The GET request has the slice id, the visualization type and, in the case of dashboards, any additional filters that the dashboard might apply.
  • This allows the charts to be cached by the browser, with the same duration as the server cache and the dataframe cache.
    • If the server cache hasn’t been invalidated, this saves a request. For dashboards, this should happen most of the time.
    • If the server cache has been invalidated, this shows stale data (stored in the client cache), forcing the user to refresh the chart/dash in order to get new data.
  • This also allows the browser to perform conditional requests, in case the client cache has expired but the data hasn’t changed. This still performs a request, but if the data hasn’t changed only headers are returned.
  1. When a user clicks “run query” in a chart, or “force refresh” in a dashboard, the chart payload is requested with a POST request.
  • This will invalidate the client cache: all stored responses that reference the chart are invalidated, including charts with extra filters from dashboards. This is an aggressive strategy, erring on the side of caution, since in theory we would only need to do this when a chart is saved or when a dashboard is force refreshed.
  • This will also invalidate the server cache, but only for that specific URL. This means that when a user visits a dashboard with extra filters after modifying a chart, they will see the old chart, requiring a manual refresh.
  1. When a user changes a filter box in a dashboard, GET requests are performed. This makes it almost instantaneous to change between values back and forth in the filter box.

Scenarios

Single user in Explore view

  1. User creates a new chart
  • Browser does POST request to explore_json with form data
  • Server caches the dataframe
  1. User click “run query” a few times, modifying the form data until they’re happy
  • Browser does POST requests to explore_json for each click
  • Server checks dataframe cache
    • If the query object hasn’t changed the cache is reused
    • Otherwise the query is run, and the dataframe is cached
  1. User saves chart (id=1)
  • Browser reloads page
  • Browser does GET request to explore_json?form_data={“slice_id”:1}
  • Response has Etag and Expires headers, browser caches it using the Cache API
  • Server caches the response
  1. User visits the chart before the cache expiration
  • Browser uses Cache API and determines cached response is valid
  • Browser reuses cached response
  • No HTTP request is made
  1. User visits the chart after the cache expiration
  • Browser uses Cache API and finds an expired cached response
  • Browser does conditional request, sending the hash of the cached response
    • If the server returns a 304, the cached response is used, and no body is transferred, only headers
    • If the server returns a 200, the response is cached and used

Single user Explore/Dashboard interaction

  1. User creates a new chart, saves it
  2. User creates dashboard with chart
  3. User visits the dashboard before cache expiration
  • Browser uses Cache API and determines cached response is valid
  • Browser reuses cached response
  • No HTTP request is made
  1. User clicks “force refresh”, either in chart or dashboard
  • Browser does POST request
  • Browser invalidates all cached responses that reference the chart
  • Server invalidates cached responses associated with the dashboard (key is based on slice id and extra filters)
  • Fresh data is served
  1. User visits dashboard again
  • Browser cache is empty from previous step
  • Browser does GET requests
  • Server side cache is empty from previous step
  • Dataframe cache is empty from previous step
  • Server computes results and store in dataframe cache and server cache
  • Browser caches responses
  1. User visits dashboard once more before the cache expiration
  • Responses in the client cache are reused
  • No HTTP requests are made
  1. User visits dashboard after cache expiration
  • Conditional requests are done, trying to reuse client cache

Multiple user Explore/Dashboard interaction

  1. User A visits dashboard with extra filters every day
  • If client cache is not expired, cached response is reused
  • Otherwise browser does conditional GET requests
    • Requests might hit server cache or dataframe cache, depending on the timing
  1. User B modifies a chart in the dashboard, using the explore view
  • The client cache is invalidated only in B’s browser
  • The server cache is invalidated only for the slice id (but not cached responses that have the extra filters)
  • The dataframe cache is invalidated
  1. User A visits the dashboard again, before cache expiration
  • The client cache wasn’t invalidated, so the response is reused, showing stale data
  • The user has to click “force refresh” in order to see the new data

Not that we can overcome the staleness problem described in (3) by making the dashboard component force refresh slices that were changed after the response was cached in the browser. The payload received by the dashboard has a changed_on attribute for each slice, and the responses in the client cache have the timestamp when the requests were made. @graceguo-supercat, I think this addresses your concern?

Performance

I measured dashboard loading times before and after the GET requests were introduced, using the top 10 dashboards at Lyft. The average improvement was 20%, and the biggest improvement observed was 60%.

Dashboard 1

  • Loading times (seconds):
    • Before: 2.32, 2.3, 2.1, 2.23, 2.52, 2.33, 3.09, 2.31, 2.24, 2.3
    • After: 2.06, 1.97, 2.76, 2.48, 2.31, 2.55, 1.93, 2.17, 2.17, 2.27
  • Improvement: 3.07%

Dashboard 2

  • Loading times (seconds):
    • Before: 2.47, 2.56, 2.68, 2.48, 2.64, 2.71, 2.55, 2.49, 2.79, 2.62
    • After: 1.68, 1.57, 1.45, 1.42, 1.44, 1.52, 1.52, 1.51, 1.67, 1.68
  • Improvement: 41.12%

Dashboard 3

  • Loading times (seconds):
    • Before: 40.19, 23.41, 1.96, 1.91, 1.96, 2.44, 2.5, 2.01, 2.18, 1.85
    • After: 2.41, 1.65, 1.72, 1.99, 1.66, 1.92, 1.84, 1.73, 1.69, 1.89
  • Improvement: 62.37%

Dashboard 4

  • Loading times (seconds):
    • Before: 8.97, 1.72, 1.69, 1.61, 1.47, 1.75, 1.54, 1.71, 1.42, 5.83
    • After: 1.79, 1.46, 1.62, 1.46, 1.62, 1.75, 1.55, 1.45, 1.74, 1.61
  • Improvement: 26.04%

Dashboard 5

  • Loading times (seconds):
    • Before: 1.84, 1.99, 2.17, 1.89, 1.88, 1.87, 1.86, 2.07, 2.08, 1.77
    • After: 2.02, 2.07, 1.9, 1.95, 1.71, 2.09, 1.94, 2.03, 1.83, 1.91
  • Improvement: -1.10%

Dashboard 6

  • Loading times (seconds):
    • Before: 45.31, 6.21, 6.11, 6.57, 5.49, 5.66, 5.94, 6.86, 6.67, 6.79
    • After: 7.7, 6.51, 5.12, 5.03, 5.24, 5.8, 3.81, 4.0, 5.17, 5.1
  • Improvement: 17.40%

Dashboard 7

  • Loading times (seconds):
    • Before: 5.1, 4.71, 5.02, 5.07, 4.82, 5.17, 4.36, 5.15, 4.79, 4.83
    • After: 1.86, 1.98, 2.03, 2.05, 2.16, 2.05, 2.14, 2.01, 2.01, 2.21
  • Improvement: 58.39%

Dashboard 8

  • Loading times (seconds):
    • Before: 1.76, 1.83, 1.95, 1.72, 1.63, 1.72, 2.09, 1.84, 1.78, 1.77
    • After: 1.93, 1.57, 1.55, 1.59, 1.91, 1.59, 1.7, 1.57, 1.61, 1.79
  • Improvement: 7.24%

Dashboard 9

  • Loading times (seconds):
    • Before: 4.41, 2.8, 1.89, 2.1, 1.84, 2.01, 1.85, 1.89, 1.85, 1.99
    • After: 1.77, 2.78, 1.77, 2.3, 2.01, 1.88, 2.18, 2.36, 2.0, 2.22
  • Improvement: -4.31%

Dashboard 10

  • Loading times (seconds):
    • Before: 1.74, 1.49, 1.93, 1.53, 1.46, 1.47, 1.6, 1.53, 1.47, 1.39
    • After: 2.73, 1.56, 1.41, 1.52, 1.5, 1.33, 1.38, 1.35, 1.51, 1.59
  • Improvement: 3.82%

Note: improvement was computed by dropping the highest/lowest values and taking the average.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.65. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@graceguo-supercat
Copy link

graceguo-supercat commented Apr 23, 2019

Designing a simple cache system, I'll like to see this list of things:

  • cache key
  • cache age
  • cache size
  • eviction policy
  • cache performance, like hit rate, cache misses, etc
  • and more...

Dataframe cache in Superset is pretty stable. It used query object as key, and cache age is configurable by CACHE_DEFAULT_TIMEOUT. Cached content size is varies from single row up to 50K rows, depends on the user's data. If user configure Superset to use redis as caching backend, cache size is not a concern by now. We don't have to offer a smart algorithm to evict cache (redis or other backend has implementation, probably LRU cache).

In #7032 and apache-superset/superset-ui#119, you introduced 2 layers of caching: client-side cache (conditional requests) and server cache. Let's see how it works (as of today in master branch):
(The following is just what i learnt by reading code, please correct me if i got it wrong)

client-side cache and conditional requests:

  • cache key: request url
  • cache age: by Expire header, configurable by CACHE_DEFAULT_TIMEOUT
  • cache size: by browser
  • eviction policy: no eviction policy implemented

server-side cache:

  • cache key: request parameter (like full form_data)
  • cache age: configurable by CACHE_DEFAULT_TIMEOUT
  • cache size: by caching backend
  • eviction policy: by caching backend

First, using request url as client-side cache key is problematic. For example, user A changed slice_1 by added a filter and saved it. Then User B won't see this change until his browser cache expired. I think #7255 is trying to resolve this issue, but it can not resolve the root cause. In #7255 the issue is exposed by user changing viz_type, so the fix is to add viz_type parameter as client-side cache key. But what if chart's other parameter get updated and saved with same chart id? viz_type is one of many chart's saved parameters, i think it's not necessary to make it as client-side cache key.

Furthermore, what if browser cache is full? My concern is, can this client-side cache scale for many, large dashboard case?

There are a few issues in switching between GET and POST methods (I think you probably fix them in #7319 by invalidate client-cache?):

  • update dashboard filter will trigger POST requests, and no client-side cache for the results
  • force refresh chart will trigger POST request, and client-side cache is not updated to refreshed data.

Last but not least, I have some concerns on security:

  • Superset used POST method for query, which will have csrf protection from FAB. If using GET request method, we don't have this protection anymore. You can argue that why do we need this layer of protection, but airbnb internally already request us to DISABLE GET method for query request.
  • GET method may expose all query parameters in request url, which may includes PII sensitive data, and further get logged by monitor system. We had a similar issue in query search page, Adding permission for can_only_access_owned_queries #7234. We exposed query parameters in query search page. Expose sensitive data in log may not as apparent as showing in web page, but it could cause more damage because the log may save for a very long time, and it may contains a lot data.
  • Generally speaking, store query results (may have PII data) in client-side is bad practice. It may not as dangerous as store user creditcard or other credentials, but localStorage or browser cache is still vulnerable to XSS attacks and that the data can be viewed by scripts on the page.

@kristw kristw added enhancement:request Enhancement request submitted by anyone from the community and removed feature_request labels Apr 30, 2019
@stale
Copy link

stale bot commented Jun 29, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

@stale stale bot added the inactive Inactive for >= 30 days label Jun 29, 2019
@stale stale bot closed this as completed Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days
Projects
None yet
Development

No branches or pull requests

3 participants