Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Commit 37f6823

Browse files
authored
Add instance name to RDATA/POSITION commands (#7364)
This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
1 parent 3eab76a commit 37f6823

File tree

10 files changed

+95
-50
lines changed

10 files changed

+95
-50
lines changed

changelog.d/7364.misc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add an `instance_name` to `RDATA` and `POSITION` replication commands.

docs/tcp_replication.md

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,17 @@ example flow would be (where '>' indicates master to worker and
1515

1616
> SERVER example.com
1717
< REPLICATE
18-
> POSITION events 53
19-
> RDATA events 54 ["$foo1:bar.com", ...]
20-
> RDATA events 55 ["$foo4:bar.com", ...]
18+
> POSITION events master 53
19+
> RDATA events master 54 ["$foo1:bar.com", ...]
20+
> RDATA events master 55 ["$foo4:bar.com", ...]
2121

2222
The example shows the server accepting a new connection and sending its identity
2323
with the `SERVER` command, followed by the client server to respond with the
2424
position of all streams. The server then periodically sends `RDATA` commands
25-
which have the format `RDATA <stream_name> <token> <row>`, where the format of
26-
`<row>` is defined by the individual streams.
25+
which have the format `RDATA <stream_name> <instance_name> <token> <row>`, where
26+
the format of `<row>` is defined by the individual streams. The
27+
`<instance_name>` is the name of the Synapse process that generated the data
28+
(usually "master").
2729

2830
Error reporting happens by either the client or server sending an ERROR
2931
command, and usually the connection will be closed.
@@ -52,7 +54,7 @@ The basic structure of the protocol is line based, where the initial
5254
word of each line specifies the command. The rest of the line is parsed
5355
based on the command. For example, the RDATA command is defined as:
5456

55-
RDATA <stream_name> <token> <row_json>
57+
RDATA <stream_name> <instance_name> <token> <row_json>
5658

5759
(Note that <row_json> may contains spaces, but cannot contain
5860
newlines.)
@@ -136,11 +138,11 @@ the wire:
136138
< NAME synapse.app.appservice
137139
< PING 1490197665618
138140
< REPLICATE
139-
> POSITION events 1
140-
> POSITION backfill 1
141-
> POSITION caches 1
142-
> RDATA caches 2 ["get_user_by_id",["@01register-user:localhost:8823"],1490197670513]
143-
> RDATA events 14 ["$149019767112vOHxz:localhost:8823",
141+
> POSITION events master 1
142+
> POSITION backfill master 1
143+
> POSITION caches master 1
144+
> RDATA caches master 2 ["get_user_by_id",["@01register-user:localhost:8823"],1490197670513]
145+
> RDATA events master 14 ["$149019767112vOHxz:localhost:8823",
144146
"!AFDCvgApUmpdfVjIXm:localhost:8823","m.room.guest_access","",null]
145147
< PING 1490197675618
146148
> ERROR server stopping
@@ -151,10 +153,10 @@ position without needing to send data with the `RDATA` command.
151153

152154
An example of a batched set of `RDATA` is:
153155

154-
> RDATA caches batch ["get_user_by_id",["@test:localhost:8823"],1490197670513]
155-
> RDATA caches batch ["get_user_by_id",["@test2:localhost:8823"],1490197670513]
156-
> RDATA caches batch ["get_user_by_id",["@test3:localhost:8823"],1490197670513]
157-
> RDATA caches 54 ["get_user_by_id",["@test4:localhost:8823"],1490197670513]
156+
> RDATA caches master batch ["get_user_by_id",["@test:localhost:8823"],1490197670513]
157+
> RDATA caches master batch ["get_user_by_id",["@test2:localhost:8823"],1490197670513]
158+
> RDATA caches master batch ["get_user_by_id",["@test3:localhost:8823"],1490197670513]
159+
> RDATA caches master 54 ["get_user_by_id",["@test4:localhost:8823"],1490197670513]
158160

159161
In this case the client shouldn't advance their caches token until it
160162
sees the the last `RDATA`.
@@ -178,6 +180,11 @@ client (C):
178180
updates, and if so then fetch them out of band. Sent in response to a
179181
REPLICATE command (but can happen at any time).
180182

183+
The POSITION command includes the source of the stream. Currently all streams
184+
are written by a single process (usually "master"). If fetching missing
185+
updates via HTTP API, rather than via the DB, then processes should make the
186+
request to the appropriate process.
187+
181188
#### ERROR (S, C)
182189

183190
There was an error
@@ -234,12 +241,12 @@ Each individual cache invalidation results in a row being sent down
234241
replication, which includes the cache name (the name of the function)
235242
and they key to invalidate. For example:
236243

237-
> RDATA caches 550953771 ["get_user_by_id", ["@bob:example.com"], 1550574873251]
244+
> RDATA caches master 550953771 ["get_user_by_id", ["@bob:example.com"], 1550574873251]
238245

239246
Alternatively, an entire cache can be invalidated by sending down a `null`
240247
instead of the key. For example:
241248

242-
> RDATA caches 550953772 ["get_user_by_id", null, 1550574873252]
249+
> RDATA caches master 550953772 ["get_user_by_id", null, 1550574873252]
243250

244251
However, there are times when a number of caches need to be invalidated
245252
at the same time with the same key. To reduce traffic we batch those

synapse/app/_base.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ def handle_sighup(*args, **kwargs):
270270

271271
# Start the tracer
272272
synapse.logging.opentracing.init_tracer( # type: ignore[attr-defined] # noqa
273-
hs.config
273+
hs
274274
)
275275

276276
# It is now safe to start your Synapse.
@@ -316,7 +316,7 @@ def setup_sentry(hs):
316316
scope.set_tag("matrix_server_name", hs.config.server_name)
317317

318318
app = hs.config.worker_app if hs.config.worker_app else "synapse.app.homeserver"
319-
name = hs.config.worker_name if hs.config.worker_name else "master"
319+
name = hs.get_instance_name()
320320
scope.set_tag("worker_app", app)
321321
scope.set_tag("worker_name", name)
322322

synapse/logging/opentracing.py

Lines changed: 10 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -171,14 +171,17 @@ def set_fates(clotho, lachesis, atropos, father="Zues", mother="Themis"):
171171
import re
172172
import types
173173
from functools import wraps
174-
from typing import Dict
174+
from typing import TYPE_CHECKING, Dict
175175

176176
from canonicaljson import json
177177

178178
from twisted.internet import defer
179179

180180
from synapse.config import ConfigError
181181

182+
if TYPE_CHECKING:
183+
from synapse.server import HomeServer
184+
182185
# Helper class
183186

184187

@@ -297,14 +300,11 @@ def _noop_context_manager(*args, **kwargs):
297300
# Setup
298301

299302

300-
def init_tracer(config):
303+
def init_tracer(hs: "HomeServer"):
301304
"""Set the whitelists and initialise the JaegerClient tracer
302-
303-
Args:
304-
config (HomeserverConfig): The config used by the homeserver
305305
"""
306306
global opentracing
307-
if not config.opentracer_enabled:
307+
if not hs.config.opentracer_enabled:
308308
# We don't have a tracer
309309
opentracing = None
310310
return
@@ -315,18 +315,15 @@ def init_tracer(config):
315315
"installed."
316316
)
317317

318-
# Include the worker name
319-
name = config.worker_name if config.worker_name else "master"
320-
321318
# Pull out the jaeger config if it was given. Otherwise set it to something sensible.
322319
# See https://github.com/jaegertracing/jaeger-client-python/blob/master/jaeger_client/config.py
323320

324-
set_homeserver_whitelist(config.opentracer_whitelist)
321+
set_homeserver_whitelist(hs.config.opentracer_whitelist)
325322

326323
JaegerConfig(
327-
config=config.jaeger_config,
328-
service_name="{} {}".format(config.server_name, name),
329-
scope_manager=LogContextScopeManager(config),
324+
config=hs.config.jaeger_config,
325+
service_name="{} {}".format(hs.config.server_name, hs.get_instance_name()),
326+
scope_manager=LogContextScopeManager(hs.config),
330327
).initialize_tracer()
331328

332329

synapse/replication/tcp/commands.py

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ class RdataCommand(Command):
9595
9696
Format::
9797
98-
RDATA <stream_name> <token> <row_json>
98+
RDATA <stream_name> <instance_name> <token> <row_json>
9999
100100
The `<token>` may either be a numeric stream id OR "batch". The latter case
101101
is used to support sending multiple updates with the same stream ID. This
@@ -105,33 +105,40 @@ class RdataCommand(Command):
105105
The client should batch all incoming RDATA with a token of "batch" (per
106106
stream_name) until it sees an RDATA with a numeric stream ID.
107107
108+
The `<instance_name>` is the source of the new data (usually "master").
109+
108110
`<token>` of "batch" maps to the instance variable `token` being None.
109111
110112
An example of a batched series of RDATA::
111113
112-
RDATA presence batch ["@foo:example.com", "online", ...]
113-
RDATA presence batch ["@bar:example.com", "online", ...]
114-
RDATA presence 59 ["@baz:example.com", "online", ...]
114+
RDATA presence master batch ["@foo:example.com", "online", ...]
115+
RDATA presence master batch ["@bar:example.com", "online", ...]
116+
RDATA presence master 59 ["@baz:example.com", "online", ...]
115117
"""
116118

117119
NAME = "RDATA"
118120

119-
def __init__(self, stream_name, token, row):
121+
def __init__(self, stream_name, instance_name, token, row):
120122
self.stream_name = stream_name
123+
self.instance_name = instance_name
121124
self.token = token
122125
self.row = row
123126

124127
@classmethod
125128
def from_line(cls, line):
126-
stream_name, token, row_json = line.split(" ", 2)
129+
stream_name, instance_name, token, row_json = line.split(" ", 3)
127130
return cls(
128-
stream_name, None if token == "batch" else int(token), json.loads(row_json)
131+
stream_name,
132+
instance_name,
133+
None if token == "batch" else int(token),
134+
json.loads(row_json),
129135
)
130136

131137
def to_line(self):
132138
return " ".join(
133139
(
134140
self.stream_name,
141+
self.instance_name,
135142
str(self.token) if self.token is not None else "batch",
136143
_json_encoder.encode(self.row),
137144
)
@@ -145,23 +152,31 @@ class PositionCommand(Command):
145152
"""Sent by the server to tell the client the stream postition without
146153
needing to send an RDATA.
147154
155+
Format::
156+
157+
POSITION <stream_name> <instance_name> <token>
158+
148159
On receipt of a POSITION command clients should check if they have missed
149160
any updates, and if so then fetch them out of band.
161+
162+
The `<instance_name>` is the process that sent the command and is the source
163+
of the stream.
150164
"""
151165

152166
NAME = "POSITION"
153167

154-
def __init__(self, stream_name, token):
168+
def __init__(self, stream_name, instance_name, token):
155169
self.stream_name = stream_name
170+
self.instance_name = instance_name
156171
self.token = token
157172

158173
@classmethod
159174
def from_line(cls, line):
160-
stream_name, token = line.split(" ", 1)
161-
return cls(stream_name, int(token))
175+
stream_name, instance_name, token = line.split(" ", 2)
176+
return cls(stream_name, instance_name, int(token))
162177

163178
def to_line(self):
164-
return " ".join((self.stream_name, str(self.token)))
179+
return " ".join((self.stream_name, self.instance_name, str(self.token)))
165180

166181

167182
class ErrorCommand(_SimpleCommand):

synapse/replication/tcp/handler.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ def __init__(self, hs):
7979
self._notifier = hs.get_notifier()
8080
self._clock = hs.get_clock()
8181
self._instance_id = hs.get_instance_id()
82+
self._instance_name = hs.get_instance_name()
8283

8384
# Set of streams that we've caught up with.
8485
self._streams_connected = set() # type: Set[str]
@@ -156,7 +157,7 @@ def start_replication(self, hs):
156157
hs.config.redis.redis_host, hs.config.redis.redis_port, self._factory,
157158
)
158159
else:
159-
client_name = hs.config.worker_name
160+
client_name = hs.get_instance_name()
160161
self._factory = DirectTcpReplicationClientFactory(hs, client_name, self)
161162
host = hs.config.worker_replication_host
162163
port = hs.config.worker_replication_port
@@ -170,7 +171,9 @@ async def on_REPLICATE(self, conn: AbstractConnection, cmd: ReplicateCommand):
170171

171172
for stream_name, stream in self._streams.items():
172173
current_token = stream.current_token()
173-
self.send_command(PositionCommand(stream_name, current_token))
174+
self.send_command(
175+
PositionCommand(stream_name, self._instance_name, current_token)
176+
)
174177

175178
async def on_USER_SYNC(self, conn: AbstractConnection, cmd: UserSyncCommand):
176179
user_sync_counter.inc()
@@ -235,6 +238,10 @@ async def on_USER_IP(self, conn: AbstractConnection, cmd: UserIpCommand):
235238
await self._server_notices_sender.on_user_ip(cmd.user_id)
236239

237240
async def on_RDATA(self, conn: AbstractConnection, cmd: RdataCommand):
241+
if cmd.instance_name == self._instance_name:
242+
# Ignore RDATA that are just our own echoes
243+
return
244+
238245
stream_name = cmd.stream_name
239246
inbound_rdata_count.labels(stream_name).inc()
240247

@@ -286,6 +293,10 @@ async def on_rdata(self, stream_name: str, token: int, rows: list):
286293
await self._replication_data_handler.on_rdata(stream_name, token, rows)
287294

288295
async def on_POSITION(self, conn: AbstractConnection, cmd: PositionCommand):
296+
if cmd.instance_name == self._instance_name:
297+
# Ignore POSITION that are just our own echoes
298+
return
299+
289300
stream = self._streams.get(cmd.stream_name)
290301
if not stream:
291302
logger.error("Got POSITION for unknown stream: %s", cmd.stream_name)
@@ -485,7 +496,7 @@ def stream_update(self, stream_name: str, token: str, data: Any):
485496
486497
We need to check if the client is interested in the stream or not
487498
"""
488-
self.send_command(RdataCommand(stream_name, token, data))
499+
self.send_command(RdataCommand(stream_name, self._instance_name, token, data))
489500

490501

491502
UpdateToken = TypeVar("UpdateToken")

synapse/server.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,8 @@ def __init__(self, hostname: str, config: HomeServerConfig, reactor=None, **kwar
234234
self._listening_services = []
235235
self.start_time = None
236236

237-
self.instance_id = random_string(5)
237+
self._instance_id = random_string(5)
238+
self._instance_name = config.worker_name or "master"
238239

239240
self.clock = Clock(reactor)
240241
self.distributor = Distributor()
@@ -254,7 +255,15 @@ def get_instance_id(self):
254255
This is used to distinguish running instances in worker-based
255256
deployments.
256257
"""
257-
return self.instance_id
258+
return self._instance_id
259+
260+
def get_instance_name(self) -> str:
261+
"""A unique name for this synapse process.
262+
263+
Used to identify the process over replication and in config. Does not
264+
change over restarts.
265+
"""
266+
return self._instance_name
258267

259268
def setup(self):
260269
logger.info("Setting up.")

synapse/server.pyi

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ class HomeServer(object):
122122
pass
123123
def get_instance_id(self) -> str:
124124
pass
125+
def get_instance_name(self) -> str:
126+
pass
125127
def get_event_builder_factory(self) -> EventBuilderFactory:
126128
pass
127129
def get_storage(self) -> synapse.storage.Storage:

tests/replication/slave/storage/_base.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ def prepare(self, reactor, clock, hs):
5757
# We now do some gut wrenching so that we have a client that is based
5858
# off of the slave store rather than the main store.
5959
self.replication_handler = ReplicationCommandHandler(self.hs)
60+
self.replication_handler._instance_name = "worker"
6061
self.replication_handler._replication_data_handler = ReplicationDataHandler(
6162
self.slaved_store
6263
)

tests/replication/tcp/test_commands.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,17 @@ def test_parse_one_word_command(self):
2828
self.assertIsInstance(cmd, ReplicateCommand)
2929

3030
def test_parse_rdata(self):
31-
line = 'RDATA events 6287863 ["ev", ["$eventid", "!roomid", "type", null, null, null]]'
31+
line = 'RDATA events master 6287863 ["ev", ["$eventid", "!roomid", "type", null, null, null]]'
3232
cmd = parse_command_from_line(line)
3333
self.assertIsInstance(cmd, RdataCommand)
3434
self.assertEqual(cmd.stream_name, "events")
35+
self.assertEqual(cmd.instance_name, "master")
3536
self.assertEqual(cmd.token, 6287863)
3637

3738
def test_parse_rdata_batch(self):
38-
line = 'RDATA presence batch ["@foo:example.com", "online"]'
39+
line = 'RDATA presence master batch ["@foo:example.com", "online"]'
3940
cmd = parse_command_from_line(line)
4041
self.assertIsInstance(cmd, RdataCommand)
4142
self.assertEqual(cmd.stream_name, "presence")
43+
self.assertEqual(cmd.instance_name, "master")
4244
self.assertIsNone(cmd.token)

0 commit comments

Comments
 (0)