Caching the Doc get response (for homepage) #1054

jvwong · 2022-03-15T14:49:04Z

I tried two versions of caching the /api/document GET to speed up the homepage carousel:

(not shown) Cache the latest docs IDs and do a db get on them for subsequent requests: This one wasn't that much faster (maybe 10%; 2.6s vs 2.3s) than the existing version - keep in mind we're already indexing by created date.
This PR: Cache the raw JSON. Of course, this one is fast (2.8s vs 0.15s), but of course misses data updates.

Resetting the cache is tricky because it won't catch direct actions on the doc model status via the admin.

Other options: Reduce the number of things in the carousel / introduce a search/browse; server-side render; leave it.

Refs #909.

…he when updated via API.

maxkfranz · 2022-03-15T15:10:12Z

Another simple option is expiry: The cache can live for at most a given period (T), e.g. one day, before it’s cleared/rebuilt. That way, it’s never >T stale

…

On Mar 15, 2022, at 10:49, Jeffrey ***@***.***> wrote: I tried two versions of caching the /api/document GET to speed up the homepage carousel: (not shown) Cache the latest docs IDs and do a db get on them for subsequent requests: This one wasn't that much faster (maybe 10%; 2.6s vs 2.3s) than the existing version - keep in mind we're already indexing by created date. This PR: Cache the raw JSON. Of course, this one is fast (2.8s vs 0.15s), but of course misses data updates. Resetting the cache is tricky because it won't catch direct actions on the doc model status via the admin. Other options: Reduce the number of things in the carousel / introduce a search/browse; server-side render; leave it. Refs #909. You can view, comment on, or merge this pull request online at: #1054 Commit Summary 7e35c32 Cache the entire response when no param values are present. Clear cache when updated via API. File Changes (2 files) M src/client/components/carousel.js (10) M src/server/routes/api/document/index.js (57) Patch Links: https://github.com/PathwayCommons/factoid/pull/1054.patch https://github.com/PathwayCommons/factoid/pull/1054.diff — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

jvwong · 2022-03-15T15:14:46Z

Another simple option is expiry: The cache can live for at most a given period (T), e.g. one day, before it’s cleared/rebuilt. That way, it’s never >T stale

makes sense. I guess the important case is that an author who submits sees their paper on the homepage. Messing around with the status of older docs isn't such a big deal.

maxkfranz · 2022-03-15T15:40:29Z

src/server/routes/api/document/index.js

+
+const docCache = new LRUCache({
+  max: DOCUMENT_IMAGE_CACHE_SIZE
+});


Are we caching more than one key/result? Why use a LRU cache in particular? If we don't need a LRU cache in particular, you could use a general cache that has a TTL expiry already built in, e.g. https://www.npmjs.com/package/node-cache

Are you thinking about making this more granular to a per-ID cache? If not, could this just be a variable that can be null/non-null?

You're right I didn't need this, only the TTL. Lemme look at that other package.

By the way the lru-cache is way outta date (4.4.1 but latest is 7.*) should we update?

Unless we have a strong reason to update to a new major version, it's usually not worth the time it takes to verify that the update hasn't broken things. Upgrading from v4 to v7 means the API probably changed quite a bit.

To clarify the first comment a bit: If we want a TTL etc., then a lib like node-cache is good. If we don't want a TTL, then we can just use a simple variable that can be nulled out.

maxkfranz · 2022-03-15T15:53:08Z

Would this also help to speed up the search feature? That's currently just caching all the docs in the browser's memory with that search lib, right?

jvwong · 2022-03-15T16:35:57Z

Would this also help to speed up the search feature? That's currently just caching all the docs in the browser's memory with that search lib, right?

Yes, this was the motivation for taking on this issue. The only technical issue is setting the number of cached docs, i.e. the default limit for the homepage case vs the search case etc.

maxkfranz · 2022-03-15T17:03:02Z

I'm assuming the home page has something like n=20, whereas the search has n=infinity?

How big would the infinity case be in KB?

jvwong · 2022-03-15T17:39:42Z

I'm assuming the home page has something like n=20, whereas the search has n=infinity?

(This PR) GET /api/document GET returns n=20 (default). The document-search (unstable) GET asks explicitly for 100.

How big would the infinity case be in KB?

Status	N (docs)	Content-Length (kb)	Elapsed (s)
`public`	20	517	2.5
`public`	100	2 758	10
`public`	121	3 279	12.1

maxkfranz · 2022-03-16T15:34:23Z

What do you think about the search including all docs by default instead of 100? If I were to do a search and an expected, existing doc doesn't come up because of the limit, then I wouldn't enjoy using the search.

In that case, we could use the cache keys so that the carousel isn't negatively affected by the search, and both would be cached.

key = '20' => used in carousel
key = 'infinity' => used in search

Notes:

The content-length is the uncompressed size, and it would be much smaller gzipped over the wire.
Eventually, we'll need server side search (maybe n=1000).
We could also trim down the docs a bit for search so unused fields don't take up space. We may not need this now, but it could be an interim solution before using server side search.

jvwong · 2022-03-16T15:46:29Z

What do you think about the search including all docs by default instead of 100? If I were to do a search and an expected, existing doc doesn't come up because of the limit, then I wouldn't enjoy using the search.

Worth a try!

Notes:
* We could also trim down the docs a bit for search so unused fields don't take up space. We may not need this now, but it could be an interim solution before using server side search.

I noticed the entire article is in the doc JSON along with citation, could easily lose the former which takes up probably most of the JSON, and is frankly available on PubMed.

jvwong · 2022-03-16T19:15:51Z

Removing the article in doc JSON helps quite a bit.
I am pretty confident nobody is using this information in the doc JSON. Will test around.

Status	N (docs)	Content-Length (kb)	Elapsed (s)
`public`	20	243	2.7
`public`	100	1 148	9.5
`public`	121	1 329	11.9

maxkfranz · 2022-03-17T14:37:19Z

The infinity, all-docs case is probably less than 1MB when gzipped. That's about the size of an image, so it would be reasonable to load them all for the search -- for now anyway

- Cache instances of only limit set to 'Infinity'

jvwong · 2022-03-17T19:55:52Z

This is on https://test.biofactoid.org

maxkfranz · 2022-03-17T20:50:48Z

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

jvwong · 2022-03-17T20:53:50Z

Possibly improvements:

The home component makes two closely staggered web service calls to retrieve documents: First to retrieve all public docs for the search (longer running) then the carousel to retrieve a recent slice. It seems that when the cache is empty, the latter is bottle-necked by the former. This has the effect of making the carousel load slow at the cost of the search.

Will move to separate issue.

maxkfranz · 2022-03-17T22:20:02Z

See also: https://rethinkdb.com/docs/secondary-indexes/javascript/

You have to explicitly use the getAll command to take advantage of secondary indexes.

If we don't use the index, it will be slow

jvwong · 2022-03-18T20:04:10Z

Last commit managed to squeeze a bit more efficiency out of the db:

Status	N (docs)	Content-Length (kb)	Elapsed (s)
`public`	20	237	1.34
`public`	100	1 193	3.76
`public`	122 ('Infinity')	1 335	4.52

jvwong · 2022-03-18T20:08:03Z

//TODO - Look at what's going on with the document pathway PNG caching (limit, speed)....

jvwong · 2022-03-21T15:39:00Z

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

This helped a lot - the likely reason is that with the limited cache size, the search would effectively punt images that would otherwise show up on the homepage. So now everything is preloaded.

I think this version works decent.

maxkfranz · 2022-03-21T20:31:50Z

Let's just bump up the max size of the image cache so all the images should be cached in memory. That should resolve the random image reloading in the search for now

This helped a lot - the likely reason is that with the limited cache size, the search would effectively punt images that would otherwise show up on the homepage. So now everything is preloaded.

I think this version works decent.

Nice. There may have been some cache thrashing going on

Cache the entire response when no param values are present. Clear cac…

7e35c32

…he when updated via API.

Set the doc cache expiry period

dd45de5

maxkfranz reviewed Mar 15, 2022

View reviewed changes

Use node-cache for implementation of doc caching

d3fbd88

Remove the article from the document JSON.

f94cf9b

Merge branch 'unstable' into iss909_cache-recent

8067a39

- Accept limit 'Infinity' interpret as all docs

c71a626

- Cache instances of only limit set to 'Infinity'

jvwong mentioned this pull request Mar 17, 2022

Long load time for search Documents impedes Carousel loading #1055

Closed

jvwong added 2 commits March 18, 2022 15:55

Add a secondary index for the doc 'status'

18a4a79

Get documents via an index for case where ids or nothing specified.

38f0891

jvwong added 2 commits March 21, 2022 11:25

Cleanup the handler logic

fd0176d

Bump image cache size.

88a429b

jvwong merged commit 34de7d8 into unstable Mar 23, 2022

jvwong deleted the iss909_cache-recent branch March 23, 2022 19:49

Caching the Doc get response (for homepage) #1054

Caching the Doc get response (for homepage) #1054

Uh oh!

Conversation

jvwong commented Mar 15, 2022

Uh oh!

maxkfranz commented Mar 15, 2022 via email

Uh oh!

jvwong commented Mar 15, 2022

Uh oh!

maxkfranz Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

jvwong Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

jvwong Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

maxkfranz Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

maxkfranz Mar 15, 2022

Choose a reason for hiding this comment

Uh oh!

maxkfranz commented Mar 15, 2022

Uh oh!

jvwong commented Mar 15, 2022

Uh oh!

maxkfranz commented Mar 15, 2022

Uh oh!

jvwong commented Mar 15, 2022

Uh oh!

maxkfranz commented Mar 16, 2022

Uh oh!

jvwong commented Mar 16, 2022

Uh oh!

jvwong commented Mar 16, 2022

Uh oh!

maxkfranz commented Mar 17, 2022

Uh oh!

jvwong commented Mar 17, 2022

Uh oh!

maxkfranz commented Mar 17, 2022

Uh oh!

jvwong commented Mar 17, 2022

Uh oh!

maxkfranz commented Mar 17, 2022

Uh oh!

jvwong commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvwong commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvwong commented Mar 21, 2022

Uh oh!

maxkfranz commented Mar 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jvwong commented Mar 18, 2022 •

edited

Loading

jvwong commented Mar 18, 2022 •

edited

Loading