Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Messages] Sync job #285

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

odesenfans
Copy link
Contributor

Added a new job that synchronizes unconfirmed messages across
the network. The goal of this job is to re-send messages missed
by nodes with the ability to push data on-chain. This can happen
because of various issues like server downtime or bugs.

This job works in three parts:

  • the publisher task periodically sends the list of all the messages
    older than the last TX block that have yet to be confirmed by
    on-chain data.
  • the receiver task stores the list of unconfirmed messages for
    each peer detected on the network.
  • the sync/aggregator task aggregates the confirmation data from
    all the nodes and fetches messages using the HTTP API.
    These messages are added to the pending message queue.

This solution is less expensive that constantly sharing all
the messages across all the nodes and guarantees that the network
will be synchronized eventually as long as the on-chain data
synchronization jobs are working. With the current implementation,
a message can remain out of sync at a maximum until a new TX
is published on-chain + the job period (5 minutes currently).

Fixed an issue where the pending message job would block on
the final messages in the queue and stop processing newer messages.

Once the job finishes the loop on all the messages in the pending
message collection, the previous implementation waits until all
the message tasks finish. This causes a delay of several hours
until the node finishes these tasks and is able to process newer
pending messages again. Messages end up being processed, but far
later than expected.

The issue arises because we never remove messages from the pending
queue if we fail to retrieve the associated content. The job
then always has messages in the queue, causing the issue.

Fixed the issue by allowing the loop to restart without waiting
for messages to be processed. We now compute an individual ID
for each pending message and add it to a set. The job will simply
ignore any message that is already being processed, allowing
for newer messages to be taken into account.
Added a new job that synchronizes unconfirmed messages across
the network. The goal of this job is to re-send messages missed
by nodes with the ability to push data on-chain. This can happen
because of various issues like server downtimes or bugs.

This job works in three parts:
* the publisher task periodically sends the list of all the messages
  older than the last TX block that have yet to be confirmed by
  on-chain data.
* the receiver task stores the list of unconfirmed messages for
  each peer detected on the network.
* the sync/aggregator task aggregates the confirmation data from
  all the nodes and fetches messages using the HTTP API.
  These messages are added to the pending message queue.

This solution is less expensive that constantly sharing all
the messages across all the nodes and guarantees that the network
will be synchronized eventually as long as the on-chain data
synchronization jobs are working. With the current implementation,
a message can remain out of sync at a maximum until a new TX
is published on-chain + the job period (5 minutes currently).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants