Description
The ValidateJoinRequest
contains the cluster state uncompressed.
This causes problems once the cluster state reaches a certain size. For one it requires a massive amount of memory even after #82608 but also, reading the full state on the transport thread outright (unlike with the publication handler that deserializes on GENERIC
) is too slow.
For a 40k indices cluster with beats mappings and an admittedly large number of data streams this is what happens:
[2022-01-27T11:35:37,960][WARN ][o.e.t.InboundHandler ] [elasticsearch-2] handling request [InboundMessage{Header{554386564}{8.1.0}{1239565}{true}{false}{false}{false}{internal:cluster/coordination/join/validate}}] took [7208ms] which is above the warn threshold of [5000ms]
We receive and deserialise a 500M+ message on the transport thread.
This becomes troublesome due the heap required just to buffer the message on a fresh master node that might otherwise be capable of handling this kind of cluster state (it's smaller on heap due to setting+mapping deduplication).
The slowness on the transport thread can mostly be blamed on the time it takes to read index settings.
This relates #80493 and setting deduplication in general. Ideally we should find a way of deduplicating the settings better to make the message smaller. Until that time a reasonable solution might be to simply compress the state in the message and read it as plain bytes, then deserialise on GENERIC
like we do for the publication handler.
An additional issue with this is that the master/sending node has to serialize this message in full which puts a problematic amount of strain on it potentially.