Skip to content

Mas i994 handoffsync #995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 18, 2023
Merged

Mas i994 handoffsync #995

merged 11 commits into from
Jan 18, 2023

Conversation

martinsumner
Copy link
Contributor

@martinsumner martinsumner commented Jan 3, 2023

Refactor handoff to avoid TCP RECV error - see #994.

  • Clarify meaning of terms:

    • handoff_timeout if the receiver does not respond to a sender message within this period, then the sender should assume the receiver is down and exit the fold. The sender can exchange sync messages every handoff_acksync_threshold batches to ensure that batches do not accumulate at the receiver causing the sender to fail to get a timely response to a future request.

    • handoff_receive_timeout if the receiver does not receive a message from the sender within this timeout, it will assume that the sender is down and exit the process. The sender will interrupt the fold if it has not sent a batch in the last handoff_receive_timeout div 3 to send a sync message as a keepalive to prevent this timeout if it is alive.

  • The default value of handoff_acksync_threshold is changed to 1 from 25. At 25 batches, and 250 bytes every object size, a backlog of 100K handoff items could develop at the receiver - requiring a sub-1ms response time for each handoff item to avoid triggering the timeout. Alternative default of 1 is less likely to see the timeout, at the small cost of an additional sync round-trip per batch.

  • A handoff_batch_threshold_count has been added to generate a threshold for a batch when this count of objects is reached. The handoff_batch_threshold is also respected (which triggers a batch on size in bytes). This is to prevent very large batches by count when object sizes are small.

  • The ?PT_MSG_OLDSYNC is now properly deprecated. The receiver will still respond, should a cluster contain a handoff sender still on a previous version, but the sender will now only send the ?PT_MSG_SYNC message.

  • Reformatting of code to fit column width.

  • Standardise handling of sync messages through riakc_core_handoff_sender:send_sync/3 function, and now error logs raised at the actual message where an error is received. The item queue is redacted in the error generated by a throw, so that important details are not obfuscated by a large list of binaries.

  • A new log has been added to log progress in sending batches every handoff_acklog_threshold batches (default 100). This allows for progress to be tracked through log indexers, rather than manually by operators calling the riak admin transfers command.

  • All code for (legacy) non-batched handoff removed.

Remove legacy code from handoff_sender.  The handoff_receiver must keep unused references, as during an upgrade an updated node could be a receiver to a non-updated node sending.

For non-batch not such an issue, as all supported versions for upgrade already supported batch.  However, receiver must still indicate it supports batch due to above problem.

All handoff receiver/sender code tidied down to 80-column width
Do away with timer based sync, and ack-sync based on threshold only.

Also log ongoing transfer progress every ack-log threshold.

Log at point of error reason for error - avoid generic {shutdown, timeout} error with no clue as to actual point of code origin.
Fix issue that configure message does not respond sync, and so another sync is now required.
Confusing that the first log on the sending of a batch will indicate a batch_count of 100 not 0
Clarify log text, as batch_size no longer fixed
Need to distinguish between failed fold and slow fold - and so the keepalive of the receiver has value.

Now implemented by checking a keepalive_next time every visit item, rather than continuously entering and exiting selective receive
@martinsumner
Copy link
Contributor Author

basho/riak_test#1369

Make a metadata exchange part of the join process.  This prevents the situation where a bucket type is active in a cluster, then a node joins (as part of cluster expansion, say), but the bucket properties are not known to joining node during handoff of  objects of that type.

Now, the join cannot be staged without a metadata exchange, so that all joining nodes know of cluster metadata (e.g. bucket types) before the join is committed and handoffs start.
It is only an attempt - failure (i.e. timeout) would be no different to the current state with a potential race, so we don't block joins.  Joins will normally be safer because of this.
@martinsumner martinsumner merged commit ad546ed into develop-3.0 Jan 18, 2023
martinsumner added a commit that referenced this pull request Feb 1, 2023
* Initial tidy

Remove legacy code from handoff_sender.  The handoff_receiver must keep unused references, as during an upgrade an updated node could be a receiver to a non-updated node sending.

For non-batch not such an issue, as all supported versions for upgrade already supported batch.  However, receiver must still indicate it supports batch due to above problem.

All handoff receiver/sender code tidied down to 80-column width

* Change AckSync to every batch by default

Do away with timer based sync, and ack-sync based on threshold only.

Also log ongoing transfer progress every ack-log threshold.

Log at point of error reason for error - avoid generic {shutdown, timeout} error with no clue as to actual point of code origin.

* Standardise send_sync into function

Fix issue that configure message does not respond sync, and so another sync is now required.

* Receiver needs vnode module not master

* Make first log indicate batch_count of 0

Confusing that the first log on the sending of a batch will indicate a batch_count of 100 not 0

* Batch threshold can be either count or size

Clarify log text, as batch_size no longer fixed

* Reinstate keepalive of receiver

Need to distinguish between failed fold and slow fold - and so the keepalive of the receiver has value.

Now implemented by checking a keepalive_next time every visit item, rather than continuously entering and exiting selective receive

* Further comments

* Update after review

* Metadata exchange on join

Make a metadata exchange part of the join process.  This prevents the situation where a bucket type is active in a cluster, then a node joins (as part of cluster expansion, say), but the bucket properties are not known to joining node during handoff of  objects of that type.

Now, the join cannot be staged without a metadata exchange, so that all joining nodes know of cluster metadata (e.g. bucket types) before the join is committed and handoffs start.

* Attempt exchange on Join

It is only an attempt - failure (i.e. timeout) would be no different to the current state with a potential race, so we don't block joins.  Joins will normally be safer because of this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants