Refactor bulk metadata visiting #1180

wks · 2024-07-29T04:36:22Z

DRAFT:

There is still performance disadvantage over the old approach. Find the cause.
(Re-)implement bulk update, VO bit searching (internal pointer) and VO bit scanning (traversal).

This PR refactors the mechanism for visiting (reading and/or updating) side metadata in bulk. Comparing with the existing approach, the new approach

uses only one callback for both bit ranges an byte ranges (TODO: Try not to use callback. Use iterator instead.)
allows arbitrary granularity instead of bytes-only
(TODO) allows breaking discontiguous metadata for a contiguous data address range into contiguous metadata sub-ranges. (Do we need it? We only need it for bulk visiting metadata for data ranges larger than 4K.)

Using one callback allows an arbitrary FnMut callback (instead two Fns) to be used to visit things (such as valid object references via VO bits) in a metadata range. This also removes the use of Cell when finding the last VO bit from an internal pointer.

Now the side metadata can be visited at any granularity, such as usize. We no longer have to do it in two steps, from bits to bytes (iterate_meta_bits) and from bytes to words (find_last_non_zero_bit_in_metadata_bytes). Now anything smaller than a grain (a usize for example) are treated as a "sub-grain range" and visited bit by bit (e.g. for each bit in a usize...).

(TODO) Make a generic mechanism to break a contiguous data address range into a single or multiple contiguous metadata ranges, so that concrete meta bit traversal mechanism can be expressed in several levels of nested for loops (or nested callbacks).

k-sareen · 2024-07-29T05:08:31Z

It might interest you to see how Android implements this: https://cs.android.com/android/platform/superproject/main/+/main:art/runtime/gc/accounting/space_bitmap-inl.h;l=103-229?q=visitMarkedRange

They have two functions VisitMarkedRange for arbitrary begin and end and Walk for going through the entire bitmap (you can optimize it to just load words, skipping all the extra complexity of the left- vs right-edge).

wks · 2024-07-29T08:32:35Z

Here are some benchmark results for bulk zeroing/setting a one-bit-per-word metadata (VO bit) for a 256-byte region (an Immix line). It can be obtained by executing MMTK_BENCH=bzero perf record cargo bench after checking out this branch.

implementation	grain	callback	normalize	fast path	result (ns)
modern	64	false	false	false	49.505
modern	64	true	false	false	12.639
modern	64	true	true	false	14.602
modern	64	true	true	true	15.868
modern	32	false	false	false	39.938
modern	32	true	false	false	6.2234
modern	32	true	true	false	11.217
modern	32	true	true	true	5.4676
modern	8	false	false	false	41.254
modern	8	true	false	false	6.4790
modern	8	true	true	false	11.243
modern	8	true	true	true	4.6617
classic	8	true	/	false	3.1614
classic	8	true	/	true	3.3967

"classic" means the old iterate_meta_bits function, and "modern" implementations use the new grain::break_range and grain::break_range_callback functions introduced in this PR.
"grain" is the granularity, i.e whether we read/write 8/16/32/64 bits at a time.
"callback" is true means using grain::break_range_callback. It visits bit/byte ranges using a callback impl FnMut(VisitRange). The other implementation is grain::break_range which returns a Vec<VisitRange>.
"normalize" means normalizing the input BitAddress arguments so that the byte address is aligned to the grain size. It is incorrect in general if we don't normalize, unless the input is already normalized. But it shows the cost of normalizing the BitAddress.
"fast path" implementations tests in the beginning of the function whether the input bit addresses are already grain-aligned.

The result shows that

Allocating a Vec for every 256 bytes of data address is definitely an overkill. It must be avoided. Callback-based implementations can avoid the iterator, but needs to handle "stopping iteration" internally, which increases complexity.
The cost of normalizing is surprisingly high. Maybe it is significant because the actual metadata set/cleared is only a 32-bit aligned region.
The fast path is effective for "modern" implementations, but makes the "classic" implementations slightly slower. Maybe that's because the "classic" implementation already checks whether the range is whole bytes at the second check (only after checking empty ranges). Maybe the slowdown is due to compiler optimizations.
Because it is only 32 bits of metadata, accessing it at 64 bit granularity will be slower because it needs to perform an atomic read-modify-write operation.
In the "modern" implementation, accessing at 32-bit granularity has no advantage over accessing at 8-bit granularity. Maybe that's because the underlying implementation is always memset.

From the results, the bottleneck of setting/zeroing a small range of metadata is the preparation work, such as figuring out the bit range and byte ranges, and ensure the range is grain-aligned. The actual bit-zeroing and bit-setting is very cheap. Also because we often bit/set and bit/clear aligned ranges (such as Immix lines), it is important that such use cases are covered by fast paths because otherwise redundant checks will dominate.

Normalizing is necessary because the input is byte address and in-byte offset obtained from address_to_meta_address and meta_byte_lshift. It may be beneficial to refactor address_to_meta_address and meta_byte_lshift so that they return grain-aligned byte addresses and in-grain offsets instead.

qinsoon · 2024-07-31T03:54:28Z

If you would like to push further with this PR, you would need to show that the refactoring will result in improvement in favorable cases and will not slow down in other cases. Otherwise it is not worth to introduce the complexity.

wks · 2024-08-06T00:58:08Z

Since this is too complicated and doesn't perform well, I am closing this PR in favor for #1181 I'll continue to do refactoring when needed in the future, bit by bit.

wks added 6 commits July 25, 2024 16:07

Implement grain mod

d98faad

Fix stupid mistakes and add tests

2f4e380

More fixes and tests

2b3cbd0

Implement bset and add bench

a4f38ec

Callback-based implementation

b855625

Refactor and bench

1ecee45

more tests

b2b18b0

wks added 2 commits July 29, 2024 17:37

Implement address_to_contiguous_meta_bit_address

f3750ee

Merge branch 'master' into fix/bulk-metadata-visit

eb4c7d4

wks mentioned this pull request Jul 29, 2024

Refactor iterate_meta_bits #1181

Merged

wks closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor bulk metadata visiting #1180

Refactor bulk metadata visiting #1180

Uh oh!

wks commented Jul 29, 2024

Uh oh!

k-sareen commented Jul 29, 2024

Uh oh!

wks commented Jul 29, 2024

Uh oh!

qinsoon commented Jul 31, 2024

Uh oh!

wks commented Aug 6, 2024

Uh oh!

Uh oh!

Refactor bulk metadata visiting #1180

Refactor bulk metadata visiting #1180

Uh oh!

Conversation

wks commented Jul 29, 2024

Uh oh!

k-sareen commented Jul 29, 2024

Uh oh!

wks commented Jul 29, 2024

Uh oh!

qinsoon commented Jul 31, 2024

Uh oh!

wks commented Aug 6, 2024

Uh oh!

Uh oh!