Skip to content

Conversation

@jvwong
Copy link
Member

@jvwong jvwong commented Mar 15, 2022

I tried two versions of caching the /api/document GET to speed up the homepage carousel:

  1. (not shown) Cache the latest docs IDs and do a db get on them for subsequent requests: This one wasn't that much faster (maybe 10%; 2.6s vs 2.3s) than the existing version - keep in mind we're already indexing by created date.

  2. This PR: Cache the raw JSON. Of course, this one is fast (2.8s vs 0.15s), but of course misses data updates.

Resetting the cache is tricky because it won't catch direct actions on the doc model status via the admin.

Other options: Reduce the number of things in the carousel / introduce a search/browse; server-side render; leave it.

Refs #909.

@maxkfranz
Copy link
Member

maxkfranz commented Mar 15, 2022 via email

@jvwong
Copy link
Member Author

jvwong commented Mar 15, 2022

Another simple option is expiry: The cache can live for at most a given period (T), e.g. one day, before it’s cleared/rebuilt. That way, it’s never >T stale

makes sense. I guess the important case is that an author who submits sees their paper on the homepage. Messing around with the status of older docs isn't such a big deal.


const docCache = new LRUCache({
max: DOCUMENT_IMAGE_CACHE_SIZE
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we caching more than one key/result? Why use a LRU cache in particular? If we don't need a LRU cache in particular, you could use a general cache that has a TTL expiry already built in, e.g. https://www.npmjs.com/package/node-cache

Are you thinking about making this more granular to a per-ID cache? If not, could this just be a variable that can be null/non-null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I didn't need this, only the TTL. Lemme look at that other package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way the lru-cache is way outta date (4.4.1 but latest is 7.*) should we update?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have a strong reason to update to a new major version, it's usually not worth the time it takes to verify that the update hasn't broken things. Upgrading from v4 to v7 means the API probably changed quite a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the first comment a bit: If we want a TTL etc., then a lib like node-cache is good. If we don't want a TTL, then we can just use a simple variable that can be nulled out.

@maxkfranz
Copy link
Member

Would this also help to speed up the search feature? That's currently just caching all the docs in the browser's memory with that search lib, right?

@jvwong
Copy link
Member Author

jvwong commented Mar 15, 2022

Would this also help to speed up the search feature? That's currently just caching all the docs in the browser's memory with that search lib, right?

Yes, this was the motivation for taking on this issue. The only technical issue is setting the number of cached docs, i.e. the default limit for the homepage case vs the search case etc.

@maxkfranz
Copy link
Member

I'm assuming the home page has something like n=20, whereas the search has n=infinity?

How big would the infinity case be in KB?

@jvwong
Copy link
Member Author

jvwong commented Mar 15, 2022

I'm assuming the home page has something like n=20, whereas the search has n=infinity?

(This PR) GET /api/document GET returns n=20 (default). The document-search (unstable) GET asks explicitly for 100.

How big would the infinity case be in KB?

Status N (docs) Content-Length (kb) Elapsed (s)
public 20 517 2.5
public 100 2 758 10
public 121 3 279 12.1

@maxkfranz
Copy link
Member

What do you think about the search including all docs by default instead of 100? If I were to do a search and an expected, existing doc doesn't come up because of the limit, then I wouldn't enjoy using the search.

In that case, we could use the cache keys so that the carousel isn't negatively affected by the search, and both would be cached.

  • key = '20' => used in carousel
  • key = 'infinity' => used in search

Notes:

  • The content-length is the uncompressed size, and it would be much smaller gzipped over the wire.
  • Eventually, we'll need server side search (maybe n=1000).
  • We could also trim down the docs a bit for search so unused fields don't take up space. We may not need this now, but it could be an interim solution before using server side search.

@jvwong
Copy link
Member Author

jvwong commented Mar 16, 2022

What do you think about the search including all docs by default instead of 100? If I were to do a search and an expected, existing doc doesn't come up because of the limit, then I wouldn't enjoy using the search.

Worth a try!

Notes:
* We could also trim down the docs a bit for search so unused fields don't take up space. We may not need this now, but it could be an interim solution before using server side search.

I noticed the entire article is in the doc JSON along with citation, could easily lose the former which takes up probably most of the JSON, and is frankly available on PubMed.

@jvwong
Copy link
Member Author

jvwong commented Mar 16, 2022

Removing the article in doc JSON helps quite a bit.
I am pretty confident nobody is using this information in the doc JSON. Will test around.

Status N (docs) Content-Length (kb) Elapsed (s)
public 20 243 2.7
public 100 1 148 9.5
public 121 1 329 11.9

@maxkfranz
Copy link
Member

The infinity, all-docs case is probably less than 1MB when gzipped. That's about the size of an image, so it would be reasonable to load them all for the search -- for now anyway

- Cache instances of only limit set to 'Infinity'
@jvwong
Copy link
Member Author

jvwong commented Mar 17, 2022

This is on https://test.biofactoid.org

@maxkfranz
Copy link
Member

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

@jvwong
Copy link
Member Author

jvwong commented Mar 17, 2022

Possibly improvements:

The home component makes two closely staggered web service calls to retrieve documents: First to retrieve all public docs for the search (longer running) then the carousel to retrieve a recent slice. It seems that when the cache is empty, the latter is bottle-necked by the former. This has the effect of making the carousel load slow at the cost of the search.

Will move to separate issue.

@maxkfranz
Copy link
Member

See also: https://rethinkdb.com/docs/secondary-indexes/javascript/

You have to explicitly use the getAll command to take advantage of secondary indexes.

If we don't use the index, it will be slow

@jvwong
Copy link
Member Author

jvwong commented Mar 18, 2022

Last commit managed to squeeze a bit more efficiency out of the db:

Status N (docs) Content-Length (kb) Elapsed (s)
public 20 237 1.34
public 100 1 193 3.76
public 122 ('Infinity') 1 335 4.52

@jvwong
Copy link
Member Author

jvwong commented Mar 18, 2022

//TODO - Look at what's going on with the document pathway PNG caching (limit, speed)....

@jvwong
Copy link
Member Author

jvwong commented Mar 21, 2022

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

This helped a lot - the likely reason is that with the limited cache size, the search would effectively punt images that would otherwise show up on the homepage. So now everything is preloaded.

I think this version works decent.

@maxkfranz
Copy link
Member

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

This helped a lot - the likely reason is that with the limited cache size, the search would effectively punt images that would otherwise show up on the homepage. So now everything is preloaded.

I think this version works decent.

Nice. There may have been some cache thrashing going on

@jvwong jvwong merged commit 34de7d8 into unstable Mar 23, 2022
@jvwong jvwong deleted the iss909_cache-recent branch March 23, 2022 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants