Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-lane loads and stores #9

Open
penzn opened this issue May 27, 2020 · 5 comments
Open

Per-lane loads and stores #9

penzn opened this issue May 27, 2020 · 5 comments

Comments

@penzn
Copy link
Contributor

penzn commented May 27, 2020

Trying to establish a new home for the discussion on masks and their alternatives. Let me know if the title does not make sense - changing would be easy.

Generally there are a few ways to handle loads and stores that require less than full register:

  1. Only allow full register loads and stores
  2. Use masks to choose which lanes are going to be affected or selected
  3. Allow setting the length per operation

Padding (#8) can be seen either as an enhancement to these or even somewhat orthogonal - when data is padded, use of some of the partial vector operations can be avoided.

The bare variant of (1) simply disables loads and stores operating on less than a hardware register. Remainders are to be handled in a scalar loop.

Masks (2) are the same approach as used in AVX and SVE. Since different instruction sets represent masks differently, they need to come with their own types and instructions. For a prior discussion on masks see: #6 (comment) and onward.

Set length (3) approach allows to exclude the tail of the vector - it is possible to exclude a lane, but only with all the lanes that come after. The upside is that the representation is simpler than masks, and works with both masking and non-masking ISAs, but downside is that it introduces internal state.

@penzn penzn mentioned this issue May 27, 2020
@jan-wassenberg
Copy link

Thanks for branching this discussion. Would it make sense to list a fourth option, that static types (e.g. vec64) implicitly define the size to load?

Continuation of discussion with Florian.. I can see you are passionate about this topic, which is great :)

And in this case, the only alternative faster that I can think of is actually the padding of the data.
This a good solution, but is not a universal solution and cannot be done by compilers as they are not allowed to do it by themselves.

This is interesting, it seems you would like autovectorization to work? From my perspective, it still isn't dependable/robust - and it's been 16 years since that went into GCC. As far as I'm concerned, it's far more reliable to use intrinsics or preferably a wrapper on top of them.

This is an example of algorithms that seem like impossible to do in parallel, and yet it is possible.

:) It was a moderately complex predictor, so the only parallelization we could see was doing it for two independent color channels at a time.
I agree this is rare, and relatively weak vectorization.

Also, I don't know how set_length would help you there as you would still need to use the length-agnostic ISA.

To be clear, I am worried about set_length for a similar reason - it's not performance-portable to older platforms.

The mechanism I had in mind is providing smaller types, vec64, vec32, and for completeness also 16/8. There's specialized codepaths for loading/storing them, but otherwise they just use the vec128 type for computations (to reduce code size). This would be in addition to the type that's the widest available.

I believe this is more efficient and safer than using masked load/store: the type already indicates how much to read, so we can just use the 64-bit load instruction directly without decoding a mask. Also, if we change our minds (oh, can only do a single element), we don't need to update all the call sites of the load/store to make sure the masks are updated.

But I think StoreOnlyN or maskedStore can be effecient enough on most architectures that it would not be a problem in practice.

Most differences in opinion are grounded in differing experiences. I tried really hard to make masked loads efficient for HighwayHash (where users give unpadded buffers and the hash function has to respect that), and was rather disappointed.
uops.info reports a masked load is 11-14 cycles! And we still need a scalar loop because the masking is only at 32-bit granularity. For comparison: block ciphers are fundamentally tied to their block size, applications just have to pad the input data, and I haven't seen anyone feel constrained by that?

And only codes that are aimed to be super fast (and not only fast) would need to pad.

This can be interesting to discuss further. I imagine that most users will already have used native code, and we've already established that they care about speed (why use SIMD otherwise). Then the question is: how much slowdown would they accept before giving up and writing blog posts how native apps are preferable? IMO something like a 10-20% penalty is acceptable, how's your thinking on this?

My point here is: if you process elements one by one, you've lost the speed battle, whatever the actual operations you apply to them.

Agreed. But in the use cases I've seen, the remainders are by definition negligible when we're processing entire scanlines or audio sample buffers or whatever.

The only case I've seen where 'remainders' are important is strlen, and I would modestly claim that null-terminated strings are a rather poor choice anyway (Pascal strings, std::string, even BSTR have fixed this for a long time).

I would say any code working on packed structures in a concurrent context (for the store problem).

Oh yes, concurrency is harder. But aren't those apps already in trouble if the vectors are of unknown length? In other words: if they can't write a full vector (however long it is), then it's a "remainder" case anyway.

@lemaitre
Copy link

First, I would like to highlight the fact that mask support is pretty much required anyway because AVX512 and SVE comparison instructions do return an actual mask, and not a "mask-like" vector like it used to be on previous architectures.

It would still be possible to use masks only for ternary operator (? :, blendv, select, ifthenelse, whatever you call it), and leave all other operations unmasked, especially memory access.
Needless to say that I don't like this approach, but definitely possible.

Padding (#8) can be seen either as an enhancement to these or even somewhat orthogonal - when data is padded, use of some of the partial vector operations can be avoided.

Padding should be the recommended way for high performance applications even if we have support for masked memory accesses or set_length.
In fact, that's already a sane recommendation when targeting native SIMD directly (no WASM).
So we should just provide a facility for that.

Masks (2) are the same approach as used in AVX and SVE. Since different instruction sets represent masks differently, they need to come with their own types and instructions. For a prior discussion on masks see: #6 (comment) and onward.

Masking is not a new concept and is partially supported for a long time (even without native mask types).
Plus, the number of actual masking operations can be drastically reduced if we mask only at the end of branches (phi nodes of the SSA representation), and stores.

A masked select is trivially and efficiently implementable in all ISAs, so the only critical part of masking is memory accesses.
But masked loads/stores have the same problem than hypothetic loadOnlyN/storeOnlyN that would be used for set_length.

Here is a summary of these problems (where I refer to masked accesses, but still apply to partial accesses):

  • aligned masked load: nothing fancy here: just do a full width regular load and apply the mask on the resulting vector.
  • unaligned masked load: Here, the problem is if a page boundary is crossed, the load can fault. But what if faulting elements are all inactive? If such a load is supported, the fault should be ignored but it appears only on inactive elements.
  • aligned masked store: you really cannot store inactive elements because this could lead to race condition if inactive elements are stored by someone else.
  • unaligned masked store: same problem than aligned ones + faulting.

Actually, masked stores are not problematic on x86 as they exist since SSE2, and even if they don't exist for 8-bit lanes on AVX2, it is still possible to split vector in 2, and use the 128-bit mask store.
Masked stores on Neon are less efficient and would require "scalar" emulation with a bunch of st1q_lane.
In this case, storeOnlyN might be more efficient than an actual masked store as you could implement it with a single computed goto instead multiple ifs.
Unaligned masked stores are the same.

Unaligned masked loads are a different beast. There is no support for masked load in SSE, and only partial support (int32 and int64) in AVX.
Still no support in Neon.
So you could just do "scalar" emulation and call it a day (like masked stores in Neon) and it would work, but not as efficient as it could be.
Or you can rely on signal handlers to detect fault, and check if the load was masked, and choose to ignore the fault if it appears on inactive elements.

The signal handler solution could work in theory, but would need a complete list of masked loads and a way to retrieve the actual mask. The benefit of this solution is that you pay for it only if you're on the verge of faulting, but is otherwise free.
The main problem is that it uses signal handlers.


Would it make sense to list a fourth option, that static types (e.g. vec64) implicitly define the size to load?

I would probably prefer to have masked gather for this.

Continuation of discussion with Florian.. I can see you are passionate about this topic, which is great :)

SIMD is my work, and I had time to think about these on my own. So why not share my thoughts and try to make the world a better place ? ;)

This is interesting, it seems you would like autovectorization to work? From my perspective, it still isn't dependable/robust - and it's been 16 years since that went into GCC. As far as I'm concerned, it's far more reliable to use intrinsics or preferably a wrapper on top of them.

Yes, I would like to see autovectorization (which is not a WASM concern as it is before WASM).

Making autovectorization work does not exclude the possibility to use intrinsics, though.
But the thing is: if you use intrinsics, the compiler is still not allowed to add padding wherever it wants/needs, so in this case, the problem remains.

The mechanism I had in mind is providing smaller types, vec64, vec32, and for completeness also 16/8. There's specialized codepaths for loading/storing them, but otherwise they just use the vec128 type for computations (to reduce code size). This would be in addition to the type that's the widest available.

Well, I have the impression that this is not really about flexible vectors then. It would be a refinement of fixed-sized SIMD for smaller types.

I believe this is more efficient and safer than using masked load/store: the type already indicates how much to read, so we can just use the 64-bit load instruction directly without decoding a mask.

It is more efficient only if the actual length is know at the time the code is generated. If it's not, then it would not be more efficient than loadOnlyN/storeOnlyN.
Also, if during code generation, the WASM virtual machine detects that the mask has a certain shape, it can generate a loadOnlyN/storeOnlyN instead of a complete masked memory access.
And if N is known, then it can generate the optimal non-masked load.

I would much prefer rely on meta-data and constant folding to optimize masked memory accesses.

Also, if we change our minds (oh, can only do a single element), we don't need to update all the call sites of the load/store to make sure the masks are updated.

This is a valid concern for source code, but not so much for "assembly".
And this is easily achievable in plain C++ with templates and overloading. (and even in C to some extent)
So I would not care much about the single modification point at the WASM level.

uops.info reports a masked load is 11-14 cycles!

That's surprising, as that's not what the Intel intrinsics guide says: it is more 8 cycles latency and more importantly, you can do 2 per cycles.
But if you know for sure data is aligned (or that you will not fault), you can just mask with a blendv after the load.
This can be achieved by checking the misalignment and see if it can cross a page boundary.

how much slowdown would they accept before giving up and writing blog posts how native apps are preferable? IMO something like a 10-20% penalty is acceptable, how's your thinking on this?

Yes, I think 10-20% is acceptable, but I also think that if the documentation of masked load/store specifically says that padding should be preferred if suitable, people will try that before complaining about speed.

The only case I've seen where 'remainders' are important is strlen, and I would modestly claim that null-terminated strings are a rather poor choice anyway (Pascal strings, std::string, even BSTR have fixed this for a long time).

Yes, but those are used to interface with the kernel so will probably never disappear.

Oh yes, concurrency is harder. But aren't those apps already in trouble if the vectors are of unknown length? In other words: if they can't write a full vector (however long it is), then it's a "remainder" case anyway.

If you have a multidimensional array where the last dimension is relatively small, but the others are much larger, but are not suited to be the last dimension because of the way processing is done.
Maybe you can't afford to pad because you're already tight in memory.
Then you still can write full vectors for the beginning of the line, but full width store at the end of the line might touch the beginning of next line that will be processed by another thread.

In this case, you can easily rely on full-width accesses if can stop before the last portion.

int i;
for (i = 0; i < w - vlen; i += vlen) {
  process(i); // not masked
}
maskX m = `i + laneid < w`;
process(i, m);

It is easy to imagine such a case where remainder is not the majority of the processing, and yet non-negligible, because repeated many times.

@penzn
Copy link
Contributor Author

penzn commented May 29, 2020

Would it make sense to list a fourth option, that static types (e.g. vec64) implicitly define the size to load?

I am a bit worried how to represent conversion between those and "main" vector types, though I do see why they can be useful.

This is interesting, it seems you would like autovectorization to work? From my perspective, it still isn't dependable/robust - and it's been 16 years since that went into GCC. As far as I'm concerned, it's far more reliable to use intrinsics or preferably a wrapper on top of them.

Yes, I would like to see autovectorization (which is not a WASM concern as it is before WASM).

Making autovectorization work does not exclude the possibility to use intrinsics, though.
But the thing is: if you use intrinsics, the compiler is still not allowed to add padding wherever it wants/needs, so in this case, the problem remains.

That is a great subject - I would like to see autovectorization working, and it should be working in at least the naive cases. The question is what compilers are capable of producing. Keep in mind that currently our only compiler backend is LLVM.

@lemaitre
Copy link

lemaitre commented Jun 3, 2020

The question is what compilers are capable of producing. Keep in mind that currently our only compiler backend is LLVM.

As far as I can see, LLVM is already quite good at vectorizing code.
But it is also not really smart for remainders for now (see this godbolt link from #7)

@jan-wassenberg
Copy link

I am a bit worried how to represent conversion between those and "main" vector types, though I do see why they can be useful.

Yes, this would need to be defined. If using the load32_zero semantics (WebAssembly/simd#237),
where upper lanes are zero, we could safely provide a function zero_extend128 and to64 etc. functions, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants