Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device list doesn't change if remote server is down is flaky #1136

Open
clokep opened this issue Sep 10, 2021 · 16 comments · Fixed by #1142
Open

Device list doesn't change if remote server is down is flaky #1136

clokep opened this issue Sep 10, 2021 · 16 comments · Fixed by #1142

Comments

@clokep
Copy link
Member

clokep commented Sep 10, 2021

https://github.com/matrix-org/synapse/runs/3570688017

I think this was just re-enabled?

@clokep
Copy link
Member Author

clokep commented Sep 13, 2021

@DMRobertson DMRobertson self-assigned this Sep 14, 2021
@clokep
Copy link
Member Author

clokep commented Sep 14, 2021

@DMRobertson
Copy link
Contributor

Well, I mean to make a PR for this but screwed up my git branch management locally.

Let's see how this goes?

@H-Shay
Copy link
Contributor

H-Shay commented Sep 14, 2021

Just noting that I had a failure on this test as well today (with several re-runs)

@DMRobertson
Copy link
Contributor

I resorted to sticking debug statements into synapse. My understanding: on the first keys query,

  • Synapse under test receives keys from mock server in sytest
  • Synapse ends up calling self.device_handler.device_list_updater.user_device_resync on the master process (it seems device lists are always handled by the master)
  • This later calls update_remote_device_list_cache
  • At the end of that function we upsert into device_lists_remote_extremeties
    • user_id='@__ANON__-0:localhost:45313' stream_id=3

All normal operation as far as I can see. On the second keys query,

  • in get_user_devices_from_cache we call get_device_list_last_stream_id_for_remotes which reads from device_lists_remote_extremeties
  • The response from this leads us to build user_map={'@__ANON__-0:localhost:41307': None}
  • we act as if we've not stored keys from the previous federation request, so go and fetch them again
  • this fails the test.

Guessing there's a race between the cache invalidation and the second device keys call?? Commenting out the @cachedList call makes the test pass (with user_map={'@__ANON__-0:localhost:34477': '3'}).

Notes:

  • get_device_list_last_stream_id_for_remotes function is decorated with @cachedList
  • it looks like we try to invalidate the cache right before we do the upsert mentioned above
  • cache done on the main worker, as is the query that builds "user_map"
  • the cache invalidation will be called "on the main twisted thread after the transaction has finished".

@erikjohnston
Copy link
Member

Oh bleurgh, that's annoying.

The other way we deal with this sort of thing is by using retry_until_success { ... } to wrap the test code, which will repeatedly run it until it passes (with some appropriate exponential backoff). The downside with this approach is that failures will result in the test timing out, rather than in a more helpful manner, so it's important to add helpful logging. An example usage:

retry_until_success {
matrix_get_room_state( $user, $room_id,
type => "m.room.member",
state_key => $user->user_id,
)->then( sub {
my ( $body ) = @_;
$body->{membership} eq "join" or
die "Expected membership to be 'join'";
Future->done(1);
})
}
});

So I think we can wrap the second call to fetch keys in that and it should work?

@DMRobertson
Copy link
Contributor

Thanks @erikjohnston, that sounds like a good avenue of attack.

For completeness, I wanted to demonstrate the race in the logs. Here's a version where I've made sytest sleep 1; between key requests.

Logs with extra sytest sleep
2021-09-15 18:33:44 ✔  $ rg "/keys|DMR" /tmp/logs/server-* --color=always | sort --key 2,4 -t : -s | less
  /tmp/logs/server-0/synchrotron.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-100 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/event_persister2.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-100 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/event_persister1.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-132 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/appservice.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-148 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-164 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/background_worker.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-180 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/federation_sender.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-196 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/pusher.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-212 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/user_dir.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-228 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/media_repository.log:2021-09-15 17:32:21,167 - root - 70 - WARNING - process-replication-data-68 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,168 - root - 70 - WARNING - process-replication-data-20 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/federation_reader.log:2021-09-15 17:32:21,170 - root - 70 - WARNING - process-replication-data-36 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/client_reader.log:2021-09-15 17:32:21,170 - root - 70 - WARNING - process-replication-data-52 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/event_creator.log:2021-09-15 17:32:21,171 - root - 70 - WARNING - process-replication-data-132 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_173213-1:localhost:8840')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,778 - synapse.handlers.e2e_keys - 117 - WARNING - POST-0 - DMR: query_devices
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,778 - synapse.handlers.e2e_keys - 137 - WARNING - POST-0 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,782 - synapse.storage.databases.main.devices - 474 - WARNING - POST-0 - DMR: user_map={'@__ANON__-0:localhost:35035': None}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,782 - synapse.storage.databases.main.devices - 481 - WARNING - POST-0 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 171 - WARNING - POST-0 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:35035'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 227 - WARNING - POST-0 - DMR: fetch user_id='@__ANON__-0:localhost:35035', device_list=[]
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,783 - synapse.handlers.e2e_keys - 231 - WARNING - POST-0 - DMR: AAAAA
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 236 - WARNING - POST-0 - DMR: BBBBB room_ids=frozenset({'!MYFsKlpvDpfGpmYNQU:localhost:8840'})
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 240 - WARNING - POST-0 - DMR: CCCCC
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,784 - synapse.handlers.e2e_keys - 251 - WARNING - POST-0 - DMR: slave
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,791 - synapse.handlers.device - 844 - WARNING - POST-10 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:35035'
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,804 - synapse.handlers.device - 895 - WARNING - POST-10 - DMR: result={'user_id': '@__ANON__-0:localhost:35035', 'devices': [{'device_display_name': 'Curiosity Rover', 'keys': {'signatures': {'@__ANON__-0:localhost:35035': {'ed2  5519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'device_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'algorithms': ['fast', 'and broken'], 'user_id': '@__ANON__-0:localhost:35035'}, 'device_id': 'CURIOSIT  Y_ROVER'}], 'stream_id': 3}
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,806 - synapse.handlers.device - 945 - WARNING - POST-10 - DMR: ignore_devices=False
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,806 - synapse.storage.databases.main.devices - 1261 - WARNING - POST-10 - DMR: user_id='@__ANON__-0:localhost:35035' devices=[{'device_display_name': 'Curiosity Rover', 'keys': {'signatures': {'@__ANON__-0:localhost:35035': {'ed  25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'device_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'algorithms': ['fast', 'and broken'], 'user_id': '@__ANON__-0:localhost:35035'}, 'device_id': 'CURIOSI  TY_ROVER'}] stream_id=3
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,807 - synapse.storage.databases.main.devices - 1306 - WARNING - POST-10 - DMR: extremities user_id='@__ANON__-0:localhost:35035' stream_id=3
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:32:21,809 - synapse.handlers.device - 953 - WARNING - POST-10 - DMR: finished marking cache as valid
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,815 - synapse.access.http.8836 - 389 - INFO - POST-0 - 127.0.0.1 - 8836 - {@anon-20210915_173213-1:localhost:8840} Processed request: 0.039sec/-0.000sec (0.008sec, 0.000sec) (0.001sec/0.003sec/6) 440B 200 "POST /_matrix/clien  t/r0/keys/query?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
  /tmp/logs/server-0/event_persister1.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-152 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/appservice.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-168 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-184 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/background_worker.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-200 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/federation_sender.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-216 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/pusher.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-232 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/user_dir.log:2021-09-15 17:32:21,922 - root - 70 - WARNING - process-replication-data-248 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/synchrotron.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-120 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/event_persister2.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-120 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/event_creator.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-152 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/client_reader.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-72 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/media_repository.log:2021-09-15 17:32:21,923 - root - 70 - WARNING - process-replication-data-88 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-40 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/federation_reader.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-56 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,825 - synapse.handlers.e2e_keys - 117 - WARNING - POST-1 - DMR: query_devices
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,826 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,833 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:35035': '3'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,836 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,840 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,842 - synapse.access.http.8836 - 389 - INFO - POST-1 - 127.0.0.1 - 8836 - {@anon-20210915_173213-1:localhost:8840} Processed request: 0.017sec/-0.000sec (0.011sec, 0.000sec) (0.003sec/0.004sec/4) 493B 200 "POST /_matrix/clien  t/r0/keys/query?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]

In particular, note that the stream_writer is told over replication about the device lists having changed:

  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:21,924 - root - 70 - WARNING - process-replication-data-40 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@__ANON__-0:localhost:35035')

So we get a correct cache hit:

  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,826 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:35035': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,833 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:35035': '3'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,836 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:32:22,840 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache=set()

Here's the version without the sleep.

Without sleep
  /tmp/logs/server-0/frontend_proxy.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-161 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/synchrotron.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-177 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/media_repository.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-193 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/event_creator.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-209 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/federation_sender.log:2021-09-15 17:45:20,277 - root - 70 - WARNING - process-replication-data-225 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/appservice.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-113 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/event_persister2.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-113 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/background_worker.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-129 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/pusher.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-145 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/event_persister1.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-17 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-65 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/federation_reader.log:2021-09-15 17:45:20,278 - root - 70 - WARNING - process-replication-data-81 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/user_dir.log:2021-09-15 17:45:20,282 - root - 70 - WARNING - process-replication-data-33 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/client_reader.log:2021-09-15 17:45:20,283 - root - 70 - WARNING - process-replication-data-49 - DMR: row=DeviceListsStream.DeviceListsStreamRow(entity='@anon-20210915_174513-1:localhost:8840')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,071 - synapse.handlers.e2e_keys - 117 - WARNING - POST-0 - DMR: query_devices
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,071 - synapse.handlers.e2e_keys - 137 - WARNING - POST-0 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:33603': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,074 - synapse.storage.databases.main.devices - 474 - WARNING - POST-0 - DMR: user_map={'@__ANON__-0:localhost:33603': None}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 171 - WARNING - POST-0 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 227 - WARNING - POST-0 - DMR: fetch user_id='@__ANON__-0:localhost:33603', device_list=[]
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.handlers.e2e_keys - 231 - WARNING - POST-0 - DMR: AAAAA
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,075 - synapse.storage.databases.main.devices - 481 - WARNING - POST-0 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 236 - WARNING - POST-0 - DMR: BBBBB room_ids=frozenset({'!PSQPVgzGgcyCSphXvA:localhost:8840'})
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 240 - WARNING - POST-0 - DMR: CCCCC
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,077 - synapse.handlers.e2e_keys - 251 - WARNING - POST-0 - DMR: slave
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,083 - synapse.handlers.device - 844 - WARNING - POST-10 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:33603'
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,096 - synapse.handlers.device - 895 - WARNING - POST-10 - DMR: result={'user_id': '@__ANON__-0:localhost:33603', 'stream_id': 3, 'devices': [{'device_display_name': 'Curiosity Rover', 'device_id': 'CURIOSITY_ROVER', 'keys': {'de  vice_id': 'CURIOSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'signatures': {'@__ANON__-0:localhost:33603': {'ed25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'user_id': '@__ANON__-0:localhost:33603', 'algorithms':   ['fast', 'and broken']}}]}
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,098 - synapse.handlers.device - 945 - WARNING - POST-10 - DMR: ignore_devices=False
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,098 - synapse.storage.databases.main.devices - 1261 - WARNING - POST-10 - DMR: user_id='@__ANON__-0:localhost:33603' devices=[{'device_display_name': 'Curiosity Rover', 'device_id': 'CURIOSITY_ROVER', 'keys': {'device_id': 'CURI  OSITY_ROVER', 'keys': {'b': 'flat', 'c': 'sharp'}, 'signatures': {'@__ANON__-0:localhost:33603': {'ed25519:JLAFKJWSCS': 'dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA'}}, 'user_id': '@__ANON__-0:localhost:33603', 'algorithms': ['fast', 'and   broken']}}] stream_id=3
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,100 - synapse.storage.databases.main.devices - 1306 - WARNING - POST-10 - DMR: extremities user_id='@__ANON__-0:localhost:33603' stream_id=3
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,101 - synapse.handlers.device - 953 - WARNING - POST-10 - DMR: finished marking cache as valid
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,108 - synapse.access.http.8836 - 389 - INFO - POST-0 - 127.0.0.1 - 8836 - {@anon-20210915_174513-1:localhost:8840} Processed request: 0.042sec/-0.000sec (0.008sec, 0.000sec) (0.004sec/0.004sec/6) 440B 200 "POST /_matrix/clien  t/r0/keys/query?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,112 - synapse.handlers.e2e_keys - 117 - WARNING - POST-1 - DMR: query_devices
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,112 - synapse.handlers.e2e_keys - 137 - WARNING - POST-1 - DMR: local_query={} remote_queries={'@__ANON__-0:localhost:33603': []}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,113 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:33603': None}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 227 - WARNING - POST-1 - DMR: fetch user_id='@__ANON__-0:localhost:33603', device_list=[]
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 231 - WARNING - POST-1 - DMR: AAAAA
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 236 - WARNING - POST-1 - DMR: BBBBB room_ids=frozenset({'!PSQPVgzGgcyCSphXvA:localhost:8840'})
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 240 - WARNING - POST-1 - DMR: CCCCC
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.storage.databases.main.devices - 481 - WARNING - POST-1 - DMR: users_needing_resync=set()
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,115 - synapse.handlers.e2e_keys - 251 - WARNING - POST-1 - DMR: slave
  /tmp/logs/server-0/homeserver.log:2021-09-15 17:45:21,116 - synapse.handlers.device - 844 - WARNING - POST-12 - DMR: user_device_resync for user_id='@__ANON__-0:localhost:33603'
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,132 - synapse.http.matrixfederationclient - 670 - WARNING - POST-1 - {POST-O-1} [localhost:33603] Request failed: POST matrix://localhost:33603/_matrix/federation/v1/user/keys/query: HttpResponseException('400: Bad Request')
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,133 - synapse.access.http.8836 - 389 - INFO - POST-1 - 127.0.0.1 - 8836 - {@anon-20210915_174513-1:localhost:8840} Processed request: 0.021sec/-0.000sec (0.004sec, 0.000sec) (0.001sec/0.001sec/3) 156B 200 "POST /_matrix/clien  t/r0/keys/query?access_token=<redacted> HTTP/1.1" "Perl + Net::Async::HTTP/0.48" [0 dbevts]

The stream_writer never gets the replication message (I'm assuming we kill the server before it has chance to arrive.) So we get an old (incorrect) cache hit:

 /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,113 - synapse.storage.databases.main.devices - 474 - WARNING - POST-1 - DMR: user_map={'@__ANON__-0:localhost:33603': None}
  /tmp/logs/server-0/stream_writer.log:2021-09-15 17:45:21,114 - synapse.handlers.e2e_keys - 171 - WARNING - POST-1 - DMR: user_ids_not_in_cache={'@__ANON__-0:localhost:33603'}

@DMRobertson
Copy link
Contributor

Unfortunately I don't think retry_until_success works either. We end up with a different kind of race!

  • POST-0: initial request, device list written on master. publish notification on replication
  • POST-1: yet to receive replication notice, so we make a federation request for device keys. This returns 400. We mark the device as requiring resync.
  • Replication notification arrives.
  • POST-2: we recognise the device in question as being in the DB's cache. But it's marked as requiring resync, so we make a federation request, which returns 400. We get stuck in a loop.

@erikjohnston erikjohnston changed the title Device list doesn't change if remote server is down is flaky Device list doesn't change if remote server is down is flaky Sep 16, 2021
DMRobertson pushed a commit that referenced this issue Sep 16, 2021
…wn` (#1142)

* Attempt to deflake `Device list doesn't change if remote server is down`

Hopefully closes #1136

Co-authored-by: Erik Johnston <erik@matrix.org>
@DMRobertson
Copy link
Contributor

Also https://github.com/matrix-org/synapse/runs/3630907641, but I think that's running against a release branch in sytest that doesn't have the above PR (merged to sytest develop)

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 21, 2021

@DMRobertson
Copy link
Contributor

@DMRobertson DMRobertson removed their assignment Mar 30, 2022
@DMRobertson DMRobertson reopened this Mar 30, 2022
@DMRobertson
Copy link
Contributor

@DMRobertson
Copy link
Contributor

@DMRobertson
Copy link
Contributor

@reivilibre
Copy link
Contributor

@richvdh
Copy link
Member

richvdh commented Jun 9, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants