Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Commit

Permalink
Stop getting missing prev_events after we already know their signat…
Browse files Browse the repository at this point in the history
…ure is invalid (#13816)

While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to
 - #13622
    - #13635
 - #13676

Part of #13356

Follow-up to #13815 which tracks event signature failures.

With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.

For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
  • Loading branch information
MadLittleMods authored Oct 15, 2022
1 parent 616dcc1 commit 40bb37e
Show file tree
Hide file tree
Showing 7 changed files with 386 additions and 2 deletions.
1 change: 1 addition & 0 deletions changelog.d/13816.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Stop fetching missing `prev_events` after we already know their signature is invalid.
21 changes: 21 additions & 0 deletions synapse/api/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -640,6 +640,27 @@ def get_dict(self) -> "JsonDict":
}


class FederationPullAttemptBackoffError(RuntimeError):
"""
Raised to indicate that we are are deliberately not attempting to pull the given
event over federation because we've already done so recently and are backing off.
Attributes:
event_id: The event_id which we are refusing to pull
message: A custom error message that gives more context
"""

def __init__(self, event_ids: List[str], message: Optional[str]):
self.event_ids = event_ids

if message:
error_message = message
else:
error_message = f"Not attempting to pull event_ids={self.event_ids} because we already tried to pull them recently (backing off)."

super().__init__(error_message)


class HttpResponseException(CodeMessageException):
"""
Represents an HTTP-level failure of an outbound request
Expand Down
16 changes: 16 additions & 0 deletions synapse/handlers/federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
Codes,
FederationDeniedError,
FederationError,
FederationPullAttemptBackoffError,
HttpResponseException,
LimitExceededError,
NotFoundError,
Expand Down Expand Up @@ -1720,7 +1721,22 @@ async def _sync_partial_state_room(
destination, event
)
break
except FederationPullAttemptBackoffError as exc:
# Log a warning about why we failed to process the event (the error message
# for `FederationPullAttemptBackoffError` is pretty good)
logger.warning("_sync_partial_state_room: %s", exc)
# We do not record a failed pull attempt when we backoff fetching a missing
# `prev_event` because not being able to fetch the `prev_events` just means
# we won't be able to de-outlier the pulled event. But we can still use an
# `outlier` in the state/auth chain for another event. So we shouldn't stop
# a downstream event from trying to pull it.
#
# This avoids a cascade of backoff for all events in the DAG downstream from
# one event backoff upstream.
except FederationError as e:
# TODO: We should `record_event_failed_pull_attempt` here,
# see https://github.com/matrix-org/synapse/issues/13700

if attempt == len(destinations) - 1:
# We have tried every remote server for this event. Give up.
# TODO(faster_joins) giving up isn't the right thing to do
Expand Down
31 changes: 31 additions & 0 deletions synapse/handlers/federation_event.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
AuthError,
Codes,
FederationError,
FederationPullAttemptBackoffError,
HttpResponseException,
RequestSendFailed,
SynapseError,
Expand Down Expand Up @@ -567,6 +568,9 @@ async def update_state_for_partial_state_event(
event: partial-state event to be de-partial-stated
Raises:
FederationPullAttemptBackoffError if we are are deliberately not attempting
to pull the given event over federation because we've already done so
recently and are backing off.
FederationError if we fail to request state from the remote server.
"""
logger.info("Updating state for %s", event.event_id)
Expand Down Expand Up @@ -901,6 +905,18 @@ async def _process_pulled_event(
context,
backfilled=backfilled,
)
except FederationPullAttemptBackoffError as exc:
# Log a warning about why we failed to process the event (the error message
# for `FederationPullAttemptBackoffError` is pretty good)
logger.warning("_process_pulled_event: %s", exc)
# We do not record a failed pull attempt when we backoff fetching a missing
# `prev_event` because not being able to fetch the `prev_events` just means
# we won't be able to de-outlier the pulled event. But we can still use an
# `outlier` in the state/auth chain for another event. So we shouldn't stop
# a downstream event from trying to pull it.
#
# This avoids a cascade of backoff for all events in the DAG downstream from
# one event backoff upstream.
except FederationError as e:
await self._store.record_event_failed_pull_attempt(
event.room_id, event_id, str(e)
Expand Down Expand Up @@ -947,6 +963,9 @@ async def _compute_event_context_with_maybe_missing_prevs(
The event context.
Raises:
FederationPullAttemptBackoffError if we are are deliberately not attempting
to pull the given event over federation because we've already done so
recently and are backing off.
FederationError if we fail to get the state from the remote server after any
missing `prev_event`s.
"""
Expand All @@ -957,6 +976,18 @@ async def _compute_event_context_with_maybe_missing_prevs(
seen = await self._store.have_events_in_timeline(prevs)
missing_prevs = prevs - seen

# If we've already recently attempted to pull this missing event, don't
# try it again so soon. Since we have to fetch all of the prev_events, we can
# bail early here if we find any to ignore.
prevs_to_ignore = await self._store.get_event_ids_to_not_pull_from_backoff(
room_id, missing_prevs
)
if len(prevs_to_ignore) > 0:
raise FederationPullAttemptBackoffError(
event_ids=prevs_to_ignore,
message=f"While computing context for event={event_id}, not attempting to pull missing prev_event={prevs_to_ignore[0]} because we already tried to pull recently (backing off).",
)

if not missing_prevs:
return await self._state_handler.compute_event_context(event)

Expand Down
54 changes: 54 additions & 0 deletions synapse/storage/databases/main/event_federation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1501,6 +1501,12 @@ async def record_event_failed_pull_attempt(
event_id: The event that failed to be fetched or processed
cause: The error message or reason that we failed to pull the event
"""
logger.debug(
"record_event_failed_pull_attempt room_id=%s, event_id=%s, cause=%s",
room_id,
event_id,
cause,
)
await self.db_pool.runInteraction(
"record_event_failed_pull_attempt",
self._record_event_failed_pull_attempt_upsert_txn,
Expand Down Expand Up @@ -1530,6 +1536,54 @@ def _record_event_failed_pull_attempt_upsert_txn(

txn.execute(sql, (room_id, event_id, 1, self._clock.time_msec(), cause))

@trace
async def get_event_ids_to_not_pull_from_backoff(
self,
room_id: str,
event_ids: Collection[str],
) -> List[str]:
"""
Filter down the events to ones that we've failed to pull before recently. Uses
exponential backoff.
Args:
room_id: The room that the events belong to
event_ids: A list of events to filter down
Returns:
List of event_ids that should not be attempted to be pulled
"""
event_failed_pull_attempts = await self.db_pool.simple_select_many_batch(
table="event_failed_pull_attempts",
column="event_id",
iterable=event_ids,
keyvalues={},
retcols=(
"event_id",
"last_attempt_ts",
"num_attempts",
),
desc="get_event_ids_to_not_pull_from_backoff",
)

current_time = self._clock.time_msec()
return [
event_failed_pull_attempt["event_id"]
for event_failed_pull_attempt in event_failed_pull_attempts
# Exponential back-off (up to the upper bound) so we don't try to
# pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
if current_time
< event_failed_pull_attempt["last_attempt_ts"]
+ (
2
** min(
event_failed_pull_attempt["num_attempts"],
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_MAXIMUM_DOUBLING_STEPS,
)
)
* BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_MILLISECONDS
]

async def get_missing_events(
self,
room_id: str,
Expand Down
Loading

0 comments on commit 40bb37e

Please sign in to comment.