Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter for a threshold value of peers removed that are delivering stale transaction or generally misbehave #28345

Closed
drozdse1 opened this issue Oct 15, 2023 · 8 comments

Comments

@drozdse1
Copy link

Rationale

After updating my client I see often the following entries in my Geth log file:

WARN [10-15|13:31:24.568] Peer delivering stale transactions       peer=5cfee384dfdda5892f9d64f651ead855f8e0fb11e80b39a13274d47d2fb719a9 rejected=41
WARN [10-15|13:31:24.789] Peer delivering stale transactions       peer=5cfee384dfdda5892f9d64f651ead855f8e0fb11e80b39a13274d47d2fb719a9 rejected=65

...

WARN [10-15|13:31:32.312] Peer delivering stale transactions       peer=536a69c3ffea7e8201c15b170996b3f305a830cbd2f65dcfb135de0cb96d75fc rejected=128
WARN [10-15|13:31:33.006] Peer delivering stale transactions       peer=a4851bd6344c4b770a0a720e6b7d21ced3681d11d07ab91b3396766fa2fe6713 rejected=128

For now I removing the causing peer manually through the console:

admin.removePeer()

Implementation

It would be very helpful if there could be a parameter to specify the maximal number of misbehaviours per peer and if this is exceeded, the corresponding peer is automatically removed or blocked for a specific period.

The misbehaviour could, of course, be related to other typical problems that lead to other warnings in the log file or are generally a danger to the stability of the Geth application.

@drozdse1
Copy link
Author

drozdse1 commented Oct 16, 2023

I have created bash script which is doing something similar as requested.

#!/bin/bash

container_name="eth-docker-execution-1"
desired_event_pattern="Peer delivering stale transactions"
threshold=20  # Number of consecutive occurrences

declare -A peer_counts

# Function to perform the action
perform_action() {
  local peer_id="$1"
  local count=${peer_counts[$peer_id]}
  echo "Performing action for peer ID: $peer_id since '$desired_event_pattern' occurred $count times consecutively."

  # Add your action here, such as notification commands or script execution.
  # Execute the docker exec command to remove the peer
  docker exec -t "$container_name" geth attach --datadir /var/lib/goethereum --exec "admin.removePeer(admin.peers.find((item) => item.id == '$peer_id').enode)"

  peer_counts["$peer_id"]=0  # Reset the counter for this peer
}

# Monitor the Docker container logs
docker logs "$container_name" 2>&1 -f --since 1m | while read line
do
  if [[ "$line" =~ $desired_event_pattern ]]; then
    # Extract the peer ID from the log message
    peer_id=$(echo "$line" | grep -oP 'peer=\K[^ ]+')
    echo "Peer ID: $peer_id"

    if [ -z "${peer_counts[$peer_id]}" ]; then
      peer_counts["$peer_id"]=1
    else
      ((peer_counts["$peer_id"]++))
    fi

    if [ "${peer_counts[$peer_id]}" -ge "$threshold" ]; then
      perform_action "$peer_id"
      peer_counts["$peer_id"]=0  # Reset the counter for this peer
    fi
  else
    # When the event pattern is not found, decrease the counter for all observed peer IDs.
    # The regular INFO logs for newly created blocks serve as a natural clock to decrease the counter again.
    for observed_peer_id in "${!peer_counts[@]}"; do
      if [ -n "${peer_counts[$observed_peer_id]}" ]; then
        ((peer_counts["$observed_peer_id"]--))
        if [ "${peer_counts[$observed_peer_id]}" -le 0 ]; then
          unset peer_counts["$observed_peer_id"]  # Remove peer ID from observation if count reaches or goes below 0
        fi
      fi
    done
  fi
done

@lightclient
Copy link
Member

I think an issue with disconnecting from a peer who is delivering stale transactions is that they may not be stale yet to them if they are behind the chain for any reason and disconnecting could exasperate the problem. Sleeping a bit before engaging the peer further seems appropriate.

@SjonHortensius
Copy link
Contributor

I have similar issues. I see ~15K/hour Peer delivering stale transactions supposed warnings - (but I'm not able to fix anything) and ~25K/hour Announced transaction size mismatch messages

I realize this is triggered by external factors but it seems excessive for something I cannot control

@lightclient
Copy link
Member

@SjonHortensius thanks for the report. The log level is a slightly different issue than modifying the actual handling of the peer. I think downgrading the error could make sense.

@karalabe
Copy link
Member

@SjonHortensius @lightclient The announces tx issue has been "fixed" on latest release.

@drozdse1
Copy link
Author

Hi @SjonHortensius can you tell me in which commit I can find the fix? I'm just curious :-)

Based on the discussion, I understand that this behaviour is not a problem in itself?

I had missed attestations in my validator and the error message indicated that the EL was not reachable via HTTP during this time. When I then looked at what Geth was doing during this time I noticed these error messages.

Can these events in high numbers not affect the stability of Geth? Based on the code base I have seen that these arise from the otherreject counter. Isn't it possible that errors are accumulating here that should perhaps be handled separately. At the moment, all remaining errors that are either not txpool.ErrAlreadyKnown or txpool.**ErrUnderpriced** end up in otherreject.

			// Track a few interesting failure types
			switch {
			case err == nil: // Noop, but need to handle to not count these

			case errors.Is(err, txpool.ErrAlreadyKnown):
				duplicate++

			case errors.Is(err, txpool.ErrUnderpriced) || errors.Is(err, txpool.ErrReplaceUnderpriced):
				underpriced++

			default:
				otherreject++
			}

I don't have exact knowledge about the possible error codes and I'm not an developer but as a user I see that Geth has phases where it can no longer be reached and during this time I see exactly these error messages.

@SjonHortensius
Copy link
Contributor

@drozdse1 you'll find the fix here: #28356 - as you can see in the commit message, these warnings are caused by Erigon sending bad tx announces. Geth complains when receiving those - but it canot drop the peer as it would cause too many invalid peers - hurting the network. By lowering the verbosity the Erigon team has time to fix the issue without Geth annoying their users too much

@karalabe
Copy link
Member

The stale tx log was also lowered, should be in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants