Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Strip whitespace from worker types in Dockerfile-workers #14165

Merged
merged 6 commits into from
Oct 14, 2022

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Oct 13, 2022

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

Resolves #14112

synapse_federation_sender1 | 2022-10-12 23:12:36,557 - synapse.replication.tcp.client - 481 - ERROR - _save_and_send_ack-0 - Error updating federation stream position
synapse_federation_sender1 | Traceback (most recent call last):
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/replication/tcp/client.py", line 471, in _save_and_send_ack
synapse_federation_sender1 |     await self.store.update_federation_out_pos(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/stream.py", line 1104, in update_federation_out_pos
synapse_federation_sender1 |     await self.db_pool.simple_update_one(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 2002, in simple_update_one
synapse_federation_sender1 |     await self.runInteraction(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 881, in runInteraction
synapse_federation_sender1 |     return await delay_cancellation(_runInteraction())
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
synapse_federation_sender1 |     result = current_context.run(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
synapse_federation_sender1 |     return g.throw(self.type, self.value, self.tb)
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 848, in _runInteraction
synapse_federation_sender1 |     result = await self.runWithConnection(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/synapse/storage/database.py", line 976, in runWithConnection
synapse_federation_sender1 |     return await make_deferred_yieldable(
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 244, in inContext
synapse_federation_sender1 |     result = inContext.theWork()  # type: ignore[attr-defined]
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
synapse_federation_sender1 |     inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 117, in callWithContext
synapse_federation_sender1 |     return self.currentContext().callWithContext(ctx, func, *args, **kw)
synapse_federation_sender1 |   File "/usr/local/lib/python3.9/site-packages/twisted/python/context.py", line 82, in callWithContext
synapse_federation_sender1 |     return func(*args, **kw)

This exception manifests when using two federation_senders in SYNAPSE_WORKER_TYPES, both with and when not using Complement. Trying other values with multiple workers like event_persister,event_persister would work where event_persister, event_persister would not. At line 448 of configure_workers_and_start.py a list is checked for values with 'count()'. It appears that the stripping of whitespace at line 423 interferes with the counting of values in the list. It won't correctly count values with extra whitespace in front. For example, the start_for_complement.sh file at line 48 exports a list of workers that ends up looking like

'event_persister,       event_persister,       background_worker,       frontend_proxy,       event_creator,       user_dir,       media_repository,       federation_inbound,       federation_reader,       federation_sender,       synchrotron,       client_reader,       appservice,       pusher'

Since the count in the list is looking for values with the whitespace stripped off, it fails and never actually executes function add_sharding_to_shared_config() which assembles the configuration for sharding inside of instance_map. This configuration is then supposed to be written to /conf/workers/shared.yaml inside the docker image. The rest of the file is written, but the values to set up sharding do not. Unfortunately, this means that any uses of the Dockerfile-workers that utilizes sharding workers, like for instance federation_sender or pusher, didn't work. I added some logging to my repo to track this down. Env worker types is the direct string as it's passed into the docker image. Notice the extra leading whitespace pulled in from a direct copy of the values in start_for_complement.sh:
Before:

Env worker types : |'event_persister,       event_persister,       background_worker,       frontend_proxy,       event_creator,       user_dir,       media_repository,       federation_inbound,       federation_reader,       federation_sender,       synchrotron,       client_reader,       appservice,       pusher'|
Processing : |'event_persister'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'       event_persister'|
new_worker_count:        |'2'|
worker_type_counter:     |'2'|
Worker Type total count: |'1'|
Processing : |'       background_worker'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       frontend_proxy'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       event_creator'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       user_dir'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       media_repository'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       federation_inbound'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       federation_reader'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       federation_sender'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       synchrotron'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       client_reader'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       appservice'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|
Processing : |'       pusher'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'0'|

After:

Env worker types : |'event_persister,       event_persister,       background_worker,       frontend_proxy,       event_creator,       user_dir,       media_repository,       federation_inbound,       federation_reader,       federation_sender,       synchrotron,       client_reader,       appservice,       pusher'|
Processing : |'event_persister'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'2'|
Running add_sharding_to_shared_config withevent_persister
Shard Parsing: |'event_persister'|
Processing : |'event_persister'|
new_worker_count:        |'2'|
worker_type_counter:     |'2'|
Worker Type total count: |'2'|
Running add_sharding_to_shared_config withevent_persister
Shard Parsing: |'event_persister'|
Processing : |'background_worker'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'frontend_proxy'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'event_creator'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'user_dir'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'media_repository'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'federation_inbound'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'federation_reader'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'federation_sender'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'synchrotron'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'client_reader'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'appservice'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|
Processing : |'pusher'|
new_worker_count:        |'1'|
worker_type_counter:     |'1'|
Worker Type total count: |'1'|

Worker Type total count is the value that's important. If it doesn't get above 1, then add_sharding_to_shared_config() doesn't execute. After fixing this on my own homeserver, the event_persister workers exploded into activity. I thought it was just a really efficient process.

Signed-off-by: Jason Little realtyem@gmail.com

Use a 'for' loop to iterate through a string, split it at a comma ',' and strip leading and trailing whitespace
with strip().
@realtyem realtyem marked this pull request as ready for review October 13, 2022 18:50
@realtyem realtyem requested a review from a team as a code owner October 13, 2022 18:50
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your thoroughness investigating this. Other than wording tweaks this LGTM.

docker/configure_workers_and_start.py Outdated Show resolved Hide resolved
changelog.d/14165.docker Outdated Show resolved Hide resolved
@DMRobertson DMRobertson changed the title Sanitize worker types List for use in Docker image for workers. Strip whitespace from worker types in Dockerfile-workers Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enabling Metrics while having a federation_sender sharded causes an error
2 participants