Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

drozdse1 · 2023-10-15T13:53:07Z

Rationale

After updating my client I see often the following entries in my Geth log file:

WARN [10-15|13:31:24.568] Peer delivering stale transactions       peer=5cfee384dfdda5892f9d64f651ead855f8e0fb11e80b39a13274d47d2fb719a9 rejected=41
WARN [10-15|13:31:24.789] Peer delivering stale transactions       peer=5cfee384dfdda5892f9d64f651ead855f8e0fb11e80b39a13274d47d2fb719a9 rejected=65

...

WARN [10-15|13:31:32.312] Peer delivering stale transactions       peer=536a69c3ffea7e8201c15b170996b3f305a830cbd2f65dcfb135de0cb96d75fc rejected=128
WARN [10-15|13:31:33.006] Peer delivering stale transactions       peer=a4851bd6344c4b770a0a720e6b7d21ced3681d11d07ab91b3396766fa2fe6713 rejected=128

For now I removing the causing peer manually through the console:

admin.removePeer()

Implementation

It would be very helpful if there could be a parameter to specify the maximal number of misbehaviours per peer and if this is exceeded, the corresponding peer is automatically removed or blocked for a specific period.

The misbehaviour could, of course, be related to other typical problems that lead to other warnings in the log file or are generally a danger to the stability of the Geth application.

The text was updated successfully, but these errors were encountered:

drozdse1 · 2023-10-16T08:06:46Z

I have created bash script which is doing something similar as requested.

#!/bin/bash

container_name="eth-docker-execution-1"
desired_event_pattern="Peer delivering stale transactions"
threshold=20  # Number of consecutive occurrences

declare -A peer_counts

# Function to perform the action
perform_action() {
  local peer_id="$1"
  local count=${peer_counts[$peer_id]}
  echo "Performing action for peer ID: $peer_id since '$desired_event_pattern' occurred $count times consecutively."

  # Add your action here, such as notification commands or script execution.
  # Execute the docker exec command to remove the peer
  docker exec -t "$container_name" geth attach --datadir /var/lib/goethereum --exec "admin.removePeer(admin.peers.find((item) => item.id == '$peer_id').enode)"

  peer_counts["$peer_id"]=0  # Reset the counter for this peer
}

# Monitor the Docker container logs
docker logs "$container_name" 2>&1 -f --since 1m | while read line
do
  if [[ "$line" =~ $desired_event_pattern ]]; then
    # Extract the peer ID from the log message
    peer_id=$(echo "$line" | grep -oP 'peer=\K[^ ]+')
    echo "Peer ID: $peer_id"

    if [ -z "${peer_counts[$peer_id]}" ]; then
      peer_counts["$peer_id"]=1
    else
      ((peer_counts["$peer_id"]++))
    fi

    if [ "${peer_counts[$peer_id]}" -ge "$threshold" ]; then
      perform_action "$peer_id"
      peer_counts["$peer_id"]=0  # Reset the counter for this peer
    fi
  else
    # When the event pattern is not found, decrease the counter for all observed peer IDs.
    # The regular INFO logs for newly created blocks serve as a natural clock to decrease the counter again.
    for observed_peer_id in "${!peer_counts[@]}"; do
      if [ -n "${peer_counts[$observed_peer_id]}" ]; then
        ((peer_counts["$observed_peer_id"]--))
        if [ "${peer_counts[$observed_peer_id]}" -le 0 ]; then
          unset peer_counts["$observed_peer_id"]  # Remove peer ID from observation if count reaches or goes below 0
        fi
      fi
    done
  fi
done

lightclient · 2023-10-16T21:46:19Z

I think an issue with disconnecting from a peer who is delivering stale transactions is that they may not be stale yet to them if they are behind the chain for any reason and disconnecting could exasperate the problem. Sleeping a bit before engaging the peer further seems appropriate.

SjonHortensius · 2023-10-17T08:10:46Z

I have similar issues. I see ~15K/hour Peer delivering stale transactions supposed warnings - (but I'm not able to fix anything) and ~25K/hour Announced transaction size mismatch messages

I realize this is triggered by external factors but it seems excessive for something I cannot control

lightclient · 2023-10-17T12:20:43Z

@SjonHortensius thanks for the report. The log level is a slightly different issue than modifying the actual handling of the peer. I think downgrading the error could make sense.

karalabe · 2023-10-17T12:50:25Z

@SjonHortensius @lightclient The announces tx issue has been "fixed" on latest release.

drozdse1 · 2023-10-17T18:25:44Z

Hi @SjonHortensius can you tell me in which commit I can find the fix? I'm just curious :-)

Based on the discussion, I understand that this behaviour is not a problem in itself?

I had missed attestations in my validator and the error message indicated that the EL was not reachable via HTTP during this time. When I then looked at what Geth was doing during this time I noticed these error messages.

Can these events in high numbers not affect the stability of Geth? Based on the code base I have seen that these arise from the otherreject counter. Isn't it possible that errors are accumulating here that should perhaps be handled separately. At the moment, all remaining errors that are either not txpool.ErrAlreadyKnown or txpool.**ErrUnderpriced** end up in otherreject.

			// Track a few interesting failure types
			switch {
			case err == nil: // Noop, but need to handle to not count these

			case errors.Is(err, txpool.ErrAlreadyKnown):
				duplicate++

			case errors.Is(err, txpool.ErrUnderpriced) || errors.Is(err, txpool.ErrReplaceUnderpriced):
				underpriced++

			default:
				otherreject++
			}

I don't have exact knowledge about the possible error codes and I'm not an developer but as a user I see that Geth has phases where it can no longer be reached and during this time I see exactly these error messages.

SjonHortensius · 2023-10-17T20:25:20Z

@drozdse1 you'll find the fix here: #28356 - as you can see in the commit message, these warnings are caused by Erigon sending bad tx announces. Geth complains when receiving those - but it canot drop the peer as it would cause too many invalid peers - hurting the network. By lowering the verbosity the Erigon team has time to fix the issue without Geth annoying their users too much

karalabe · 2023-10-19T08:37:33Z

The stale tx log was also lowered, should be in the next release.

drozdse1 added the type:feature label Oct 15, 2023

karalabe closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

drozdse1 commented Oct 15, 2023

drozdse1 commented Oct 16, 2023 •

edited

Loading

lightclient commented Oct 16, 2023

SjonHortensius commented Oct 17, 2023

lightclient commented Oct 17, 2023

karalabe commented Oct 17, 2023

drozdse1 commented Oct 17, 2023

SjonHortensius commented Oct 17, 2023

karalabe commented Oct 19, 2023

Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

Comments

drozdse1 commented Oct 15, 2023

Rationale

Implementation

drozdse1 commented Oct 16, 2023 • edited Loading

lightclient commented Oct 16, 2023

SjonHortensius commented Oct 17, 2023

lightclient commented Oct 17, 2023

karalabe commented Oct 17, 2023

drozdse1 commented Oct 17, 2023

SjonHortensius commented Oct 17, 2023

karalabe commented Oct 19, 2023

drozdse1 commented Oct 16, 2023 •

edited

Loading