Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Using an external cache server #2123

Open
slipeer opened this issue Apr 12, 2017 · 8 comments
Open

Using an external cache server #2123

slipeer opened this issue Apr 12, 2017 · 8 comments
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@slipeer
Copy link
Contributor

slipeer commented Apr 12, 2017

I found that synapse uses a cache, implemented its own means (synapse/util/caches).
This creates some difficulties:

I believe that using an external cache server like memcached or reddis could solve these problems.

The advantages of an external cache:

  • it can be accessed by all the processes of application, possibly running on several nodes.
  • memory storage is quite efficient, and done in a separate process.
  • it becomes possible to configure workers to use same cache and same database as synapse.
  • it becomes possible to configure two synapse servers are behind the balancer and work with one cache and a shared database.
@richvdh
Copy link
Member

richvdh commented Sep 19, 2018

Some thoughts on external caches which arose as a result of discussions around #3798.

First of all, it's certainly not true that you could just put two synapse masters behind a load balancer and point them at the same database and cache. That said, there are certainly some advantages to an external cache, including:

  • the ability to share caches between workers (currently we are limited in the number of synchrotrons we can deploy by the RAM available on our boxes)
  • faster restarts (currently it takes quite a while for synapse to warm its caches after a restart)
  • not having to invent our own cache-invalidation protocol.

However, there are also some potential problems with an external cache:

  • The in-application cache implementation helps reduce CPU usage, as we can cache structured data without the need for serialisation and deserialisation. We might be able to mitigate this by (say) caching python pickles rather than json (python-memcached has support for pickling and unpickling objects), but still, there is an overhead in doing so.

  • There are real concerns over the latency of an external cache. Even if an external cache can manage a lookup latency of (say) 1ms, that's a lot of latency when you're talking about tens of thousands of lookups a second.

  • We're reluctant to require users to configure a memcached or redis for a simple deployment. (There is debate over how bad it would be to just fall back to the database in this case.)

A likely scenario is therefore that we would end up with an in-memory cache as well as an external cache, which leaves us with twice as many problems in terms of cache invalidation. On the other hand, our biggest, and most latency-sensitive caches (eg, the event cache) are never actually invalidated (they are simple LRU caches).

A plausible compromise might be to drop invalidation support from the in-memory caches, and for things that might care about invalidation, instead either go straight to the external cache / db, or use a short TTL.

However, all of that is a non-trivial amount of work.

@cjdelisle
Copy link
Contributor

Hey I wanna bump this issue because time spent in accessing memory is/was by my calculation the majority of time spent in the synapse event loop. Here's my logic:

Every time a program has to hit main memory, the CPU essentially stalls for what is said to be around 300 nanoseconds. If there is another thread then the processor can schedule it (hyperthreading) but this is too short a timespan for the kernel to be able to swap the thread out so it appears as though the thread is using 100% CPU in this time. However, we're in a single thread event loop so there's no other thread that can be swapped in, the process is stalled.

If you're accessing a dictionary then you're looking at 300 * log2(n) nanoseconds of sleep time per access. So lets say you had a dictionary of 1mn items and then you accessed it 200 times, that's going to be 300 * 20 * 200 or 1.2 milliseconds. That's 1.2ms which is stolen from your main loop, putting an absolute upper bound of about 800 requests per second.

Now lets say you were to shove this off on memcached. First advantage memcached has is threads, it can use many of them to access memory so when one stalls, it's not blocking the main event loop, but furthermore, when one memcached thread has 100 requests to process, it can trigger the CPU's prefetch instruction on all of them before it actually tries to access any of them. The prefetch instruction causes the CPU's memory controller to copy the relevant memory location into the processor cache so that when the memory access is requested, it's a 20ns L3 hit rather than a 300ns main memory access. The memcached and redis people have no doubt spent many sleepless nights juicing every possible cycle of performance out of the processor because that's their raison d'être.

Now I want to also bust some myths about an external cache:

  1. The parsing and serializing is slow: Ok, I can't really bust this but I'm going to tell you that in all my experience, I've never found that computation was ever slow, every single time I had performance issues it was always memory lookups, CPUs are just astonishingly fast.
  2. Network latency to the cache is going to make user experience worse: So lets say (as a fabricated example) you need to lookup the user data, then from that you need to lookup the list of channels the user is in, then after that you need to lookup each one of those channels for the most recent message. Lets also imagine that your memcached instance is in another country and you have all of 20ms network latency to it. You have 2 requests which are dependent on one another but the requests for each of the channels can be parallelized so that's roughly 60ms of time spent getting stuff from the cache. This is going to keep query time in the 200ms happy zone and the CPU on-die time you spend on this request is microseconds rather than milliseconds, so your max requests per second rises by an order of magnitude which prevents requests queueing -- the thing that causes response time to run into seconds.
  3. It will be hard for admins to setup synapse if they need to install memcached: This is true, and if it is important to support stand-alone instances, what you can do is wrap the memcached functions and provide a secondary implementation in python which is not going to be high performance but will work. However, I would start by asking who is using the sqlite version and how much developer effort is added because it has to be maintained/tested/supported and it might turn out that there is more value to be had by simply offering admins a docker image with postgres and memcached rather than trying to support many different configurations.

So now you might be thinking "yeah yeah, maybe there's a few percent lost here but it can't be serious", so I'll show you some data. This goes back to an old version, I believe 0.28, which is when I was admining synapse.

Screen Shot 2019-11-25 at 08 35 45

This is a [CPU Flame Graph](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#OtherLanguages) which is a compilation of stack traces indicating where synapse spends most of it's time, the mouse is over the `intern_dict` function which is interning strings and thus needs to walk over a very large internal dictionary, easily millions or 10s of millions, and it's doing it hundreds of times because it's interning every result from the database. According to the flamegraph, **synapse spends 57% of its time inside of the cursor_to_dict function**. This might nolonger be true but I highly recommend the use of flamegraphs because my experience has been that whenever there is a noticable performance issue, it's always a very small piece of the code which is easily improved once one knows about it.

As a final comment, moving state to memcached (or redis, I don't actually have any opinion on them) is a very good thing to do, because you will eventually reach a point where synapse becomes entirely stateless, and at that point there's really nothing preventing you from pointing multiple synapse instances at the same backend. There may be a few places where you'll want to take out a lock, but this is also doable. And once multiple synapse instances can be pointed at the same backend, you can reduce development effort by discontinuing the code for the worker model.

@richvdh
Copy link
Member

richvdh commented Nov 25, 2019

@cjdelisle thanks for the detailed thoughts on this.

Broadly I agree with you - I think there could be real benefits from using an external cache.

A couple of things though:

  • firstly, if your metrics are based on synapse 0.28 with python 2, they are way out of date. Python 3 shows very different cpu usage patterns.
  • I think your example of a request that requires two cache lookups is naive. Our whole problem here is that we are tens of thousands of cache lookups, for a relatively small number of requests (say 100 requests per second). 100 cache lookups per request wouldn't be an unreasonable average, and some requests (large /syncs for instance) could be a couple of orders of magnitude larger.

@cjdelisle
Copy link
Contributor

On point 1 you're absolutely right, I'm not admining synapse at this moment so I don't have anything to collect new ones from. If you are collecting them on matrix.org then that's excellent, if you're not then please consider it, with perf it is safe to do on production and the value of the data cannot be over-stated.

On point 2, the number of lookups is not what's important, what's important is the length of the critical path. As per my example, doing a lookup to get the list of rooms for a user and then a lookup for each room that they're in might be hundreds of lookups but it has a critical path of 2. Of course you would know better than me what are the lengths of typical critical paths...

@richvdh
Copy link
Member

richvdh commented Nov 25, 2019

we've collected plenty of flame graphs on matrix.org using pyflame, though I don't have one to hand at the moment, I'm afraid.

I'd be interested if you have a mechanism for producing them with perf?

@cjdelisle
Copy link
Contributor

Ahh, I think I misremembered and in fact it was pyflame that I was using. Anyway if you're on top of the profiling game then I guess I'm just blowing smoke with now-ancient performance issues in which case I'm sorry for the bother.

@clokep
Copy link
Member

clokep commented Dec 2, 2021

in #9198 we implemented the ability to share some information between workers using Redis as a cache. I'm not sure that this issue is "done" or not, however.

@richvdh
Copy link
Member

richvdh commented Dec 2, 2021

I don't think it's done done. There is a lot more stuff we could usefully put in an external cache.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

5 participants