-
Notifications
You must be signed in to change notification settings - Fork 212
Types support and min_distance.py function example #512
Types support and min_distance.py function example #512
Conversation
tc/core/tc2halide.cc
Outdated
return Float(16, 4); | ||
case lang::TK_VECTOR4_FLOAT32: | ||
return Float(32, 4); | ||
case lang::TK_VECTOR4_FLOAT64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these used in this PR? Explicit vector types in the front-end seems like a bad idea to me. I thought that Halide would reject them anyway at Func definition time - it views vectorization as a property of a loop, and thus target-dependent and implicit, not as a property of the data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving your vector types into a separate WIP PR. They are not supported anywhere downstream (promotion, reductions) and will also fail on any operation other than addition given your library code.
cc @Ravenwater |
Sounds, good I'll put them in a WIP PR @abadams they are currently used in Re. types it is one of the goals of TC to help easily implement and experiment with various types (quantized, posits, vector, sparse etc). Is it envisionable to have the pieces of Halide we use currently view ADTs as first class citizens? If not, at least let us propagate more exotic types to the TC mapper. |
e9686e2
to
7fae544
Compare
In Halide, mathematical vectors, complex numbers, etc are represented as Tuples. The vector types are for simd, which is just a way of executing a loop, not a data type. E.g. simd on the next version of arm doesn't necessarily have a statically known width, so that's not a useful data type. If we want vector front-end types for the purpose of representing vectors, e.g. in R^3, we should lower them to Halide Tuples. |
0775328
to
07a4f95
Compare
07a4f95
to
189f44c
Compare
typedef unsigned short uint16; | ||
typedef unsigned int uint32; | ||
typedef unsigned long uint64; | ||
// typedef half float16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This essentially means that the compilation will fail with some, obscure for the user, message from nvrtc when attempting to use float16
. I'd rather ditch it, or add a TC compilation error saying "float16 not supported because nvrtc does not want it".
_(TK_INT16, "int16", "int16") \ | ||
_(TK_INT32, "int32", "int32") \ | ||
_(TK_INT64, "int64", "int64") \ | ||
_(TK_FLOAT16, "float16", "float16") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have both float32
and float
? Do we want to differentiate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the semantics for the other types it is consistent.
Removing float will break codes.
I don't see a big issue in the duplication here.
tc/lang/sema.h
Outdated
case tok: \ | ||
code_ = c; \ | ||
bits_ = b; \ | ||
#define TYPE_INFO_OPTION(tok, c, b, l) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is l
user for?
189f44c
to
1394c1d
Compare
This commit adds missing type support to the lexer, semantic analyzer, and cuda implementation. This also adds language and end-to-end functional tests for all types we support. An annoying type issue comes from the inability to include system dependencies with NVRTC so half support is explicitly disabled. Maybe it is time to think about moving to NVCC in a separate process, which we have been discussing for some time. The following commit adds an issue submitted by @mdouze which now runs properly thanks to this commit.
This commit adds examples provided by @mdouze where the argmin over a reduced sum is required. These examples are now functional thanks to the previous commit but extra work is needed to make some of the variants perform reasonably: 1. for the fused kernel to parallelize properly across blocks we need grid synchronization. This may be a nice concrete use case @math-fehr 2. for the 1-stage fissioned implementation we need device-wide synchronization otherwise we will always be limited by running on a single SM 3. the 2-stage fissioned implementations can give us performance today after tuning. Without tuning the results on the larger size (1e7, 32, 16) are shown [here](https://gist.github.com/nicolasvasilache/8a0addfb6831a831b2dca45c612f9c2d). `mindis_16_32_10000000` is the totally fused kernel and performs evry poorly. The following 5 kernels correspond to the final use case of interest.
1394c1d
to
c20658a
Compare
Addresses #511, naive options perf of the 5 kernel variant seems reasonable for a start.
A lot more work is needed to get the single kernel and the 1-staged impl to good performance so the 2-staged version will probably be a nice low hanging fruit.