Skip to content

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Open
@jsdw

Description

@jsdw

At some point recently, telemetry.polkadot.io went downwith lots of errors like:

2022-09-30 10:33:26,536 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174701)
2022-09-30 10:33:26,538 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217267)
2022-09-30 10:33:26,905 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217346)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,202 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,204 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,680 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)
2022-09-30 10:33:29,421 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(216564)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
^C

Restarting the telemetry-core pod didn't help.
Restarting the shards make things work again.

These errors imply that shards were sending information abotu nodes that the core knew nothing about.

Is there a chance that the core was restarted at some point (perhaps due to being out of memory or whatnot) and the shards didn't properly handle this and send new node information?

Alternately, is it possible that the connection between core and shards faultered and the core didn't properly clean up its internal state when this happened? (Right offhand I can't see anything that would drop all of the nodes in the core when a shard connection was lost).

The latter is also something that's a little harder to test locally (we'll have tested restarting shards and core plenty). Perhaps #497 also arose as a result of some conneciton issue like this that led to duplicates not being cleaned up?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions