Mas i994 handoffsync #995

martinsumner · 2023-01-03T13:14:07Z

Refactor handoff to avoid TCP RECV error - see #994.

Clarify meaning of terms:
- handoff_timeout if the receiver does not respond to a sender message within this period, then the sender should assume the receiver is down and exit the fold. The sender can exchange sync messages every handoff_acksync_threshold batches to ensure that batches do not accumulate at the receiver causing the sender to fail to get a timely response to a future request.
- handoff_receive_timeout if the receiver does not receive a message from the sender within this timeout, it will assume that the sender is down and exit the process. The sender will interrupt the fold if it has not sent a batch in the last handoff_receive_timeout div 3 to send a sync message as a keepalive to prevent this timeout if it is alive.
The default value of handoff_acksync_threshold is changed to 1 from 25. At 25 batches, and 250 bytes every object size, a backlog of 100K handoff items could develop at the receiver - requiring a sub-1ms response time for each handoff item to avoid triggering the timeout. Alternative default of 1 is less likely to see the timeout, at the small cost of an additional sync round-trip per batch.
A handoff_batch_threshold_count has been added to generate a threshold for a batch when this count of objects is reached. The handoff_batch_threshold is also respected (which triggers a batch on size in bytes). This is to prevent very large batches by count when object sizes are small.
The ?PT_MSG_OLDSYNC is now properly deprecated. The receiver will still respond, should a cluster contain a handoff sender still on a previous version, but the sender will now only send the ?PT_MSG_SYNC message.
Reformatting of code to fit column width.
Standardise handling of sync messages through riakc_core_handoff_sender:send_sync/3 function, and now error logs raised at the actual message where an error is received. The item queue is redacted in the error generated by a throw, so that important details are not obfuscated by a large list of binaries.
A new log has been added to log progress in sending batches every handoff_acklog_threshold batches (default 100). This allows for progress to be tracked through log indexers, rather than manually by operators calling the riak admin transfers command.
All code for (legacy) non-batched handoff removed.

Remove legacy code from handoff_sender. The handoff_receiver must keep unused references, as during an upgrade an updated node could be a receiver to a non-updated node sending. For non-batch not such an issue, as all supported versions for upgrade already supported batch. However, receiver must still indicate it supports batch due to above problem. All handoff receiver/sender code tidied down to 80-column width

Do away with timer based sync, and ack-sync based on threshold only. Also log ongoing transfer progress every ack-log threshold. Log at point of error reason for error - avoid generic {shutdown, timeout} error with no clue as to actual point of code origin.

Fix issue that configure message does not respond sync, and so another sync is now required.

Confusing that the first log on the sending of a batch will indicate a batch_count of 100 not 0

Clarify log text, as batch_size no longer fixed

Need to distinguish between failed fold and slow fold - and so the keepalive of the receiver has value. Now implemented by checking a keepalive_next time every visit item, rather than continuously entering and exiting selective receive

martinsumner · 2023-01-04T09:32:30Z

basho/riak_test#1369

Make a metadata exchange part of the join process. This prevents the situation where a bucket type is active in a cluster, then a node joins (as part of cluster expansion, say), but the bucket properties are not known to joining node during handoff of objects of that type. Now, the join cannot be staged without a metadata exchange, so that all joining nodes know of cluster metadata (e.g. bucket types) before the join is committed and handoffs start.

It is only an attempt - failure (i.e. timeout) would be no different to the current state with a potential race, so we don't block joins. Joins will normally be safer because of this.

* Initial tidy Remove legacy code from handoff_sender. The handoff_receiver must keep unused references, as during an upgrade an updated node could be a receiver to a non-updated node sending. For non-batch not such an issue, as all supported versions for upgrade already supported batch. However, receiver must still indicate it supports batch due to above problem. All handoff receiver/sender code tidied down to 80-column width * Change AckSync to every batch by default Do away with timer based sync, and ack-sync based on threshold only. Also log ongoing transfer progress every ack-log threshold. Log at point of error reason for error - avoid generic {shutdown, timeout} error with no clue as to actual point of code origin. * Standardise send_sync into function Fix issue that configure message does not respond sync, and so another sync is now required. * Receiver needs vnode module not master * Make first log indicate batch_count of 0 Confusing that the first log on the sending of a batch will indicate a batch_count of 100 not 0 * Batch threshold can be either count or size Clarify log text, as batch_size no longer fixed * Reinstate keepalive of receiver Need to distinguish between failed fold and slow fold - and so the keepalive of the receiver has value. Now implemented by checking a keepalive_next time every visit item, rather than continuously entering and exiting selective receive * Further comments * Update after review * Metadata exchange on join Make a metadata exchange part of the join process. This prevents the situation where a bucket type is active in a cluster, then a node joins (as part of cluster expansion, say), but the bucket properties are not known to joining node during handoff of objects of that type. Now, the join cannot be staged without a metadata exchange, so that all joining nodes know of cluster metadata (e.g. bucket types) before the join is committed and handoffs start. * Attempt exchange on Join It is only an attempt - failure (i.e. timeout) would be no different to the current state with a potential race, so we don't block joins. Joins will normally be safer because of this.

martinsumner added 8 commits January 2, 2023 21:15

Standardise send_sync into function

2dffde1

Fix issue that configure message does not respond sync, and so another sync is now required.

Receiver needs vnode module not master

34ec1a1

Make first log indicate batch_count of 0

d3b8dd2

Confusing that the first log on the sending of a batch will indicate a batch_count of 100 not 0

Batch threshold can be either count or size

11fadfc

Clarify log text, as batch_size no longer fixed

Reinstate keepalive of receiver

25ae3dd

Need to distinguish between failed fold and slow fold - and so the keepalive of the receiver has value. Now implemented by checking a keepalive_next time every visit item, rather than continuously entering and exiting selective receive

Further comments

8c7eb17

martinsumner mentioned this pull request Jan 4, 2023

Handoff and TCP RECV timeout #994

Closed

Update after review

882353a

yorkshire-steve approved these changes Jan 16, 2023

View reviewed changes

martinsumner added 2 commits January 17, 2023 11:39

Attempt exchange on Join

0b26d1a

It is only an attempt - failure (i.e. timeout) would be no different to the current state with a potential race, so we don't block joins. Joins will normally be safer because of this.

yorkshire-steve approved these changes Jan 18, 2023

View reviewed changes

martinsumner merged commit ad546ed into develop-3.0 Jan 18, 2023

martinsumner mentioned this pull request Feb 27, 2023

Handoff key filtering (repair or ring resize handoff) does not send periodic sync messages to keep TCP connection alive #557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mas i994 handoffsync #995

Mas i994 handoffsync #995

Uh oh!

martinsumner commented Jan 3, 2023 •

edited

Loading

Uh oh!

martinsumner commented Jan 4, 2023

Uh oh!

Uh oh!

Mas i994 handoffsync #995

Mas i994 handoffsync #995

Uh oh!

Conversation

martinsumner commented Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinsumner commented Jan 4, 2023

Uh oh!

Uh oh!

martinsumner commented Jan 3, 2023 •

edited

Loading