Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Vecs without repeated re-allocation and copies #35

Open
overlookmotel opened this issue Jun 6, 2024 · 2 comments
Open

Build Vecs without repeated re-allocation and copies #35

overlookmotel opened this issue Jun 6, 2024 · 2 comments

Comments

@overlookmotel
Copy link

overlookmotel commented Jun 6, 2024

The problem with Vecs

Bumpalo packs data tightly and grows downwards. This is generally a good thing, but when building Vecs, it's a disadvantage because when a Vec needs to grow beyond capacity, it always reallocates and is copied.

For example, while parsing a program, the top level statements Vec will initially be capacity 4 (I think that's the default). When a 5th statement is found:

  • Vec<Statement> grows to capacity 8.
  • Space for [Statement; 8] is allocated into arena.
  • 1st 4 elements are copied to the new location.
  • 5th element is written after them.

When more than 8 statements are found, the process repeats again.

This causes 3 different problems:

  1. Repeated data copying (slow).
  2. Leaves unused space behind in arena, making data less tightly packed and less cache-friendly.
  3. Where the Vec's backing storage ends up being located is unpredictable - could be before the node data, or after, or in the middle.

This problem is most acute for the top-level Vec<Statement>, as programs can be many many statements.

Possible solutions

1. Estimate number of statements from source text size

For top-level Vec<Statement>, make a guess at eventual size based on source text size and allocate a Vec with (hopefully) sufficient capacity.

If we guess correctly, this will reduce copying a lot. However, it's hard to guess accurately. A module which is entirely wrapped in an IIFE could be massive in terms of source text size, but is only a single statement. Webpack-bundled files will also have a small number of statements.

Over-allocation is probably not so bad. It will leave large chunks of memory unused in arena, but that empty memory is never accessed, so shouldn't affect caching. If it's a large stretch of memory, OS may not even allocate memory pages for it.

It's also hard to guess number of statements in functions ahead of time. We could probably take good guesses on size of Vecs for other things. e.g. Most functions don't have more than 4 arguments.

2. Small vec optimization

Use a "small vec" type for Vecs which often have only 1 member. There's a few good candidates for this:

  • Vec<Argument> in CallExpression (foo(bar))
  • Vec<Directive> in FunctionBody (function foo() { 'use strict'; })
  • Vec<AssignmentTargetProperty> in ObjectAssignmentTarget (const {foo} = bar;)
  • Vec<ImportDeclarationSpecifier> in ImportDeclaration (import foo from 'bar';, import {foo} from 'bar';)
  • Vec<Expression> in ImportExpression (import('./foo.js'))

If we could reduce size of enums to 8 bytes with pointer compression/pointer tagging, could fit 2 elements inline, which would make it suitable for a few more types. But regardless, many types wouldn't be suitable, and it still it doesn't help with the biggest problem Vec<Statement>.

3. Vec<Box<T>>

For Vecs where the content is not already an enum with boxed variants:

Box the Vec's elements. So when Vec grows, you're just copying Box<T>s rather than Ts. In some cases T is quite large e.g. FormalParameter is 80 bytes, whereas Box<FormalParameter> is 8 bytes.

Trade-off of read speed vs write speed:

  • Faster to push/insert/delete from Vecs.
  • Slower to iterate over their contents due to greater indirection.

4. Replace Vec with chunked linked list

Chunked linked list

As list grows, chunks would be added to end of linked list in sizes of increasing powers of 2 (same growth strategy as Vec). This keeps the memory overhead of forward pointers low, compared to a standard linked list.

Advantages:

  • No memory copies at all involved in building a Vec = faster in parser.
  • No wasted space/gaps in arena.
  • Better cache locality than Vec for large lists for enum types, as memory for a list chunk is close to the memory containing the Box<T> content.
  • Faster to insert into/delete elements as you only need to shuffle up/down entries in one chunk, rather than the whole list (see also transformer: Sync AST nodes and scopes oxc#3503).

Disadvantages:

  • Iterating over a Vec becomes slightly more complicated, so would have a small cost. But I expect that cost to be very minimal (1 branch which is mostly not taken).
  • You can't index into the list, only iterate over it.

Indexing: I don't think we often need to index into Vecs. If we want to provide a Path API (#30 (comment)), that would require indexing. We could design a type which does allow indexing by using index.trailing_zeros() to calculate which chunk. This type would be large, so only suitable for top-level statements.

5. Build Vecs in scratch space

Add a scratch space arena to allocator. In the parser, build Vecs initially in the scratch space, and then copy the contents into main arena once the Vec is complete.

In parser, you're only building 1 x Vec at a time, so the scratch space can be treated as a stack, growing upwards.

Advantages:

  • We end up with a standard Vec type.
  • No wasted space/gaps in arena.
  • No problem indexing into Vec.
  • Less memory copies than current method.

Disadvantages vs chunked linked list:

  • More memory copies than chunked linked list.
  • Worse cache locality.
  • Special API required for using vec builder.

Number of copies: Currently building a Vec with between 129 and 255 elements results in 252 elements being copied in total (4 + 8 + 16 + 32 + 64 + 128). Scratch space method copies the number of elements only once (129 to 255 copied elements) - so reduced copying in almost all cases. Copy happens only once in a single go, which may be faster (compiler will use SIMD ops and loop unrolling).

Conclusion

All the above could be used in combination.

If we don't need indexing, chunked linked list is probably the best solution. Would need to benchmark it though.

@overlookmotel
Copy link
Author

overlookmotel commented Jun 8, 2024

I realised a chunked linked list could also be a more compact data representation for Vec<Statement>.

Currently for 4 statements, it's stored as follows:

Bytes Content
0 Enum discriminant 1
1-7 Unused
8-15 Box<T> 1
16 Enum discriminant 2
17-23 Unused
24-31 Box<T> 2
32 Enum discriminant 3
33-39 Unused
40-47 Box<T> 3
48 Enum discriminant 4
49-55 Unused
56-63 Box<T> 4

Could be stored in chunks as:

Bytes Content
0 Enum discriminant 1
1 Enum discriminant 2
2 Enum discriminant 3
3 Enum discriminant 4
4-7 Unused
8-15 Box<T> 1
16-23 Box<T> 2
24-31 Box<T> 3
32-39 Box<T> 4

4 elements = 40 bytes, instead of 64 bytes (10 bytes per element).
8 elements = 72 bytes, instead of 128 bytes (9 bytes per element).

To be fair, chunking isn't required for this optimization. We could also pack a contiguous list using the same technique.

@overlookmotel
Copy link
Author

overlookmotel commented Jul 18, 2024

Using a chunked linked list, I think it could be possible to make Traverse visitors able to add statements to a parent/ancestor Vec<Statement> from a child.

The key would be to ensure that the node which is currently on the visitation "branch" does not move. You may shuffle up/down items before or after the node on current visitation branch, or insert/delete chunks before/after current chunk, but the current node stays put, so its pointer remains valid during visitation.

This would require:

  • Each chunk contains both a forward link to next chunk, and also a back link to previous (doubly-linked list).
  • Probably minimum chunk size of 8, or the overhead of the 2 x back/forward pointers starts outweighing the memory taken by the data itself.
  • "Empty" sentinel value for slots.
  • Ideally enums stored in such a vec have max 128 variants, so top bit of discriminant can be used for "empty".

Would work particularly well with chunks of constant size e.g. 8, 16 or 32 slots. If used the scheme of splitting enum discriminants and pointers apart described in previous comment, you could find an empty slot with a single SIMD instruction - the same way Swiss Table / hashbrown does.

Layout of an 8-item chunk of Statements:

  • Bytes 0-7: Enum discriminants for each slot (8 bytes).
  • Bytes 8-71: Pointers for each slot (64 bytes).
  • Bytes 72-79: Pointer to previous chunk.
  • Bytes 80-87: Pointer to next chunk.

88 bytes = 11 bytes per element.

16-item chunk = 152 bytes = 9.5 bytes per element.

Or aligned to cache line sizes:

5-item chunk in 64 bytes:

  • Bytes 0-4: Enum discriminants for each slot (5 bytes).
  • Bytes 5-7: Unused (3 bytes).
  • Bytes 8-47: Pointers for each slot (40 bytes).
  • Bytes 48-55: Pointer to previous chunk.
  • Bytes 56-63: Pointer to next chunk.

64 bytes = 12.8 bytes per item.

12-item chunk in 128 bytes:

  • Bytes 0-11: Enum discriminants for each slot (12 bytes).
  • Bytes 12-15: Unused (4 bytes).
  • Bytes 16-111: Pointers for each slot (96 bytes).
  • Bytes 112-119: Pointer to previous chunk.
  • Bytes 120-127: Pointer to next chunk.

128 bytes = 10.7 bytes per item.

The latter seems a pretty good fit for top level Vec<Statement> - 1.7 bytes overhead per item (when full to capacity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant