Skip to content

Refactor bulk metadata visiting #1180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

wks
Copy link
Collaborator

@wks wks commented Jul 29, 2024

DRAFT:

  • There is still performance disadvantage over the old approach. Find the cause.
  • (Re-)implement bulk update, VO bit searching (internal pointer) and VO bit scanning (traversal).

This PR refactors the mechanism for visiting (reading and/or updating) side metadata in bulk. Comparing with the existing approach, the new approach

  • uses only one callback for both bit ranges an byte ranges (TODO: Try not to use callback. Use iterator instead.)
  • allows arbitrary granularity instead of bytes-only
  • (TODO) allows breaking discontiguous metadata for a contiguous data address range into contiguous metadata sub-ranges. (Do we need it? We only need it for bulk visiting metadata for data ranges larger than 4K.)

Using one callback allows an arbitrary FnMut callback (instead two Fns) to be used to visit things (such as valid object references via VO bits) in a metadata range. This also removes the use of Cell when finding the last VO bit from an internal pointer.

Now the side metadata can be visited at any granularity, such as usize. We no longer have to do it in two steps, from bits to bytes (iterate_meta_bits) and from bytes to words (find_last_non_zero_bit_in_metadata_bytes). Now anything smaller than a grain (a usize for example) are treated as a "sub-grain range" and visited bit by bit (e.g. for each bit in a usize...).

(TODO) Make a generic mechanism to break a contiguous data address range into a single or multiple contiguous metadata ranges, so that concrete meta bit traversal mechanism can be expressed in several levels of nested for loops (or nested callbacks).

@k-sareen
Copy link
Collaborator

It might interest you to see how Android implements this: https://cs.android.com/android/platform/superproject/main/+/main:art/runtime/gc/accounting/space_bitmap-inl.h;l=103-229?q=visitMarkedRange

They have two functions VisitMarkedRange for arbitrary begin and end and Walk for going through the entire bitmap (you can optimize it to just load words, skipping all the extra complexity of the left- vs right-edge).

@wks
Copy link
Collaborator Author

wks commented Jul 29, 2024

Here are some benchmark results for bulk zeroing/setting a one-bit-per-word metadata (VO bit) for a 256-byte region (an Immix line). It can be obtained by executing MMTK_BENCH=bzero perf record cargo bench after checking out this branch.

implementation grain callback normalize fast path result (ns)
modern 64 false false false 49.505
modern 64 true false false 12.639
modern 64 true true false 14.602
modern 64 true true true 15.868
modern 32 false false false 39.938
modern 32 true false false 6.2234
modern 32 true true false 11.217
modern 32 true true true 5.4676
modern 8 false false false 41.254
modern 8 true false false 6.4790
modern 8 true true false 11.243
modern 8 true true true 4.6617
classic 8 true / false 3.1614
classic 8 true / true 3.3967
  • "classic" means the old iterate_meta_bits function, and "modern" implementations use the new grain::break_range and grain::break_range_callback functions introduced in this PR.
  • "grain" is the granularity, i.e whether we read/write 8/16/32/64 bits at a time.
  • "callback" is true means using grain::break_range_callback. It visits bit/byte ranges using a callback impl FnMut(VisitRange). The other implementation is grain::break_range which returns a Vec<VisitRange>.
  • "normalize" means normalizing the input BitAddress arguments so that the byte address is aligned to the grain size. It is incorrect in general if we don't normalize, unless the input is already normalized. But it shows the cost of normalizing the BitAddress.
  • "fast path" implementations tests in the beginning of the function whether the input bit addresses are already grain-aligned.

The result shows that

  • Allocating a Vec for every 256 bytes of data address is definitely an overkill. It must be avoided. Callback-based implementations can avoid the iterator, but needs to handle "stopping iteration" internally, which increases complexity.
  • The cost of normalizing is surprisingly high. Maybe it is significant because the actual metadata set/cleared is only a 32-bit aligned region.
  • The fast path is effective for "modern" implementations, but makes the "classic" implementations slightly slower. Maybe that's because the "classic" implementation already checks whether the range is whole bytes at the second check (only after checking empty ranges). Maybe the slowdown is due to compiler optimizations.
  • Because it is only 32 bits of metadata, accessing it at 64 bit granularity will be slower because it needs to perform an atomic read-modify-write operation.
  • In the "modern" implementation, accessing at 32-bit granularity has no advantage over accessing at 8-bit granularity. Maybe that's because the underlying implementation is always memset.

From the results, the bottleneck of setting/zeroing a small range of metadata is the preparation work, such as figuring out the bit range and byte ranges, and ensure the range is grain-aligned. The actual bit-zeroing and bit-setting is very cheap. Also because we often bit/set and bit/clear aligned ranges (such as Immix lines), it is important that such use cases are covered by fast paths because otherwise redundant checks will dominate.

Normalizing is necessary because the input is byte address and in-byte offset obtained from address_to_meta_address and meta_byte_lshift. It may be beneficial to refactor address_to_meta_address and meta_byte_lshift so that they return grain-aligned byte addresses and in-grain offsets instead.

@wks wks mentioned this pull request Jul 29, 2024
@qinsoon
Copy link
Member

qinsoon commented Jul 31, 2024

If you would like to push further with this PR, you would need to show that the refactoring will result in improvement in favorable cases and will not slow down in other cases. Otherwise it is not worth to introduce the complexity.

@wks
Copy link
Collaborator Author

wks commented Aug 6, 2024

Since this is too complicated and doesn't perform well, I am closing this PR in favor for #1181 I'll continue to do refactoring when needed in the future, bit by bit.

@wks wks closed this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants