-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lane types #6
Comments
With "one size fits all," how do we know what instructions to lower to? We have to get the lane size from somewhere, right? |
The instructions would still have the lane size, just the input and output types would be the same. For example if the type is called
|
One size fits all is more consistent with v128. But having vec.i8 (fine-grained breakdown by underlying type and size) has some benefits too:
There is a spectrum here, on one end we have the user source code, where we know what type they want and are dealing with, on the other end is the hardware implementation, where it's all bits. I think my leaning towards closer to the user source code because I have recently been working on debugging :) For codegen, I don't think there will be a big difference with any of these types. |
I've dealt with this a lot in Cranelift. And it is a pain. |
:) Will be interested to hear specifics, a high level description of what you've done please? |
Well, one part is the display of values but that's not the biggest pain (e.g. bytecodealliance/wasmtime#1650). Cranelift does have a type system with types like To try to bring it back to flexible vectors: if we added types to vectors here for the reasons you describe, should we add them to the 128-bit proposal as well? |
Sorry for confusion, I did not mean to propose storing type with the value. What I thought for typed lanes is that the return type is specific to lane type and the instruction is also specific to lane type.
I do think more type safety is better than less type safety, but wasn't sure whether conversions would be an issue 😄
From the point of view of separating types this definitely works, but checking the values we have loaded from the stack would mean a branch (and also would mean that the actual hardware instruction would be picked when such instruction executes, which can be an issue too).
There are two thing which make backtracking the value type very hard, if not impossible - memory is untyped (we don't know what was the intended type of a value in memory) and there can be control flow affecting value type. I don't think there has been a good solution to this yet. Debugging might benefit from 'views' into the value - user can flip between what this value can be, trying to match with the type they think they are going to see. |
Ah, got it, so you're choosing between these 3 signatures? The actual value type is just
The rest of my comment builds on my misunderstanding of having different vec value types,
Yea good point on the branch, so probably we don't want
This sounds like you're agreeing that having types associated with the values ( If we had typed loads, Control flow is more manageable, it might require an ah-hoc control-flow-stack validation like we already do, but since control flow is structured anyway, I think we can always figure it out. But if we had typed vectors, then this also won't be an issue, since validation will take care that the merges always have the same types (sorry this part is a bit fuzzy I am not super familiar with the details but I think that's how it works). |
There is something crucial for flexible vectors that has been overlooked here: masks (or lane predicates). If you don't have masks, support for AVX512 and SVE will be strongly crippled. The only way to have efficient masks both on legacy architectures and new architecture with native masks is to have separate mask types per lane width. But we still can have the same mask type for different types as long as they have the same lane width.
I don't see much benefit for having strongly typed vectors otherwise, but I think it would not harm either. Let's give some examples for unsigned integer types (just for the sake of writing C types instead of registers):
Please note how in AVX512, the mask type is suffixed with the number of lanes and not the width of the lanes or the width of the vector. |
@ngzhian, thank you for the feedback, I updated the list in the issue description.
I think that is desirable. We can start with per-lane-type types and see if that can be efficiently supported. Hardware SIMD registers don't distinguish lane types, but I don't think adding this would cost anything. Having lane types encoded into the types of local variables would help with debugging - it wouldn't be necessary to track down operations to find out how to display the vector. @lemaitre we are proposing |
@penzn Yes, I have seen that you propose
As you said in #7, masks can emulate I think mask question is more related to this issue than to #7, so I continue here, but we can move back if you prefer.
I think that is actually not true. Of course, it requires to have a "select" for each and every masked instruction, but not all instructions actually require to be masked, even within a branch. You could even envision to have multiple masking policies, one of them being "undefined" or "don't care". Such a policy might not be useful at first glance, but first it could be used to save power on mask aware architectures, and it can also be used by the WASM virtual machine to fuse registers of separate branches (with disjoint masks) if the target architecture supports masks natively. Actually, this very policy already exist in the C bindings for SVE. |
I agree Some code wants to only read/write e.g. 2 int32, for which sub-128-bit (but power of two) vector types can be useful. Their load/store only touch e.g. 8 bytes (e.g. So that's efficient for <128 bit but what about apps that only want to read/write 123 ints?
Thumbs up. Highway takes this approach, and masks are mainly passed to the 'ternary operator' aka |
Memory access aside, masks are available for all types as it can always be implemented as a bit vector of the same size and perform bitwise operations. In my previous comment, I also described a "don't care" masking policy that would be efficient on all architectures (even legacy as it could just be ignored) allowing to have the complete branch masked, and still pay for it on legacy architectures only at the end of the branch where the masking policy will not be "don't care".
Those already exists (to some extent): they are called int64, int32 and int16. Now, I would argue that either you only need 2 int32, then you could do kind-of fine with int64, or you actually want to process pairs of int32. For this to work, you would need some sub-SIMD operations like independant shuffles on sub-elements of 64-bit lanes, or some kind of load low/high.
Yes, but not always possible. So we still need a way to handle remainders when they do exist.
Why would you want to go scalar when you can easily stay in vector with masks for the remainder only, and have a huge gain on small loops? |
I agree masks are helpful and should be provided as separate types, my only concern is a programming model that claims all hardware can do StoreOnlyN(vector, count, ptr) for all types efficiently.
Yes, that use case processed pairs at a time and I'm not sure it would work with SWAR.
I agree length-agnostic code is always better when possible. In this case, each pair depended on the previous pair.
Yes, it would be harder if we wanted to do shuffles on those pairs, but often simple independent-lane ops are enough.
I'm worried about "moral hazard" - if StoreOnlyN is provided, it will be used, and probably much more frequently than really required (for remainder handling), which would be counterproductive. The more painful remainder handling is, the likelier that apps would fine a way to actually pad :) Do we have some examples where apps really can't on principle, as opposed to implementation convenience for legacy code that didn't foresee this?
An interesting point of definition. Mathematically yes, but doesn't the vector type behave very differently? int overflow is allowed, floats have extra ops such as rsqrt/AND, etc. |
It depends on your threshold for "efficient". I have no problem recommending users to use a special function to pad their data, but a fallback solution is still required. Also, it is not because we provide an instruction, that this instruction is necessarily fast.
It sounds like a scan (or prefix-"sum"). If it's really not that, and cannot be adapted in a similar way, then you're probably out of luck and short (128-bit) SIMD, or SWP might be the answer. But I have the impression that it is very rare in practice.
I understand you concern. But I think There is 2 axes: make most apps fast and make it possible to have apps as fast as possible.
I would say any code working on packed structures in a concurrent context (for the store problem). But to be fair, I have no concrete example.
My point here is: if you process elements one by one, you've lost the speed battle, whatever the actual operations you apply to them. |
Closes WebAssembly#6
This issue originally was about whether lanes should be typed or not (which I now reflected in the title, apologies for not doing that earlier). So far hope is that we can make lanes typed, not just in terms of size, but also in terms of what can go in it - it would be more in line with what Wasm does and would be easier to debug. @lemaitre @jan-wassenberg let's move the discussion on masks on #9 if you don't mind. |
@penzn Sorry for the off-topic. My initial point was, if you have masks (and I think we don't have the choice here, even if we have So you would need at least per lane-width types. ( Now, there is an argument I just remembered to have fully typed vectors: in AVX1, there is no integer operations on 256-bit registers. |
@penzn Thanks for bringing us back to the original topic. It sounded like people were viewing lane-size-and-type (vec.i32) favorably, and I'd also agree type-safety is useful for vectors. For the masks themselves, maybe the size is enough? |
With #1 open, there is an interesting detail - how to define the types. Since we want to slice operations by lane type, there are a few ways to approach the 'register' type:
Edited, thanks to @ngzhian for clarifying questions:
vec
orfvec
, which would be used by all the operations, regardless of the lane typevec.f32.add :: vec -> vec -> vec
,vec.i32.mul :: vec -> vec -> vec
,vec.load :: i32 -> vec
simd
proposalvec.v8
,vec.v16
,vec.v32
,,vec.v64
- integer and floating point operations working with the same lane size would take the same typevec.f32.add :: vec.v32 -> vec.v32 -> vec.v32
,vec.i32.mul :: vec.v32 -> vec.v32 -> vec.v32
,vec.f64.add :: vec.v64 -> vec.v64 -> vec.v64
,vec.i64.add :: vec.v64 -> vec.v64 -> vec.v64
,vec.v8.load :: i32 -> vec.v8
vec.i8
,,vec.i32
,vec.f32
, etc - everything specific to a particular data typevec.f32.add :: vec.f32 -> vec.f32 -> vec.f32
,vec.i32.mul :: vec.i32 -> vec.i32 -> vec.i32
,vec.i16.sub :: vec.i16 -> vec.i16 -> vec.i16
,vec.i8.load :: i32 -> vec.i8
I am leaning towards the first solution, with the single type completely interchangeable between various operations, mainly because it is simpler and better aligns with hardware.
The text was updated successfully, but these errors were encountered: