Skip to content
This repository has been archived by the owner on Apr 28, 2023. It is now read-only.

[WIP][DO NOT MERGE] Experimental vector types #513

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nicolasvasilache
Copy link
Contributor

This is a WIP experiment, please do not review.

I am looking on some feedback on how to best propagate vector types through Halide following up on the discussion from #511 and #512.

The first 2 commits in the stack are irrelevant.

In a first experiment I'm interested in using some type annotation in TC to express that a particular type is a vector with proper alignment (i.e. that it can be loaded exactly in an x86 type register and that I can define operators on the type thanks to intrinsics). In that experiment, using a TC vector type in the language will be for the purpose of blackboxing it in the TC mapper and guarantee low-level SIMD code will be generated.

The fact that Halide makes the design decision to make vectorization a property of the loop is orthogonal to my experiment and is not something I plan to inherit in TC before experimenting. Of course I'd like to convey the information through Halide in a proper way, if possible.

So what would you recommend in this case?
Using Halide's vector type seemed quite natural and it seems to get the job done (i.e. test_compile_and_run.cc actually produces code that runs without crashing).

In light of your comments on #512, do I understand properly that lanes are only meant to be used internally within Halide?

This commit adds missing type support to the lexer, semantic analyzer,
and cuda implementation.
This also adds language and end-to-end functional tests for all types we support.

An annoying type issue comes from the inability to include system dependencies
with NVRTC so half support is explicitly disabled.
Maybe it is time to think about moving to NVCC in a separate process, which
we have been discussing for some time.

The following commit adds an issue submitted by @mdouze which now runs
properly thanks to this commit.
This commit adds examples provided by @mdouze where the argmin over
a reduced sum is required.
These examples are now functional thanks to the previous commit but extra
work is needed to make some of the variants perform reasonably:
1. for the fused kernel to parallelize properly across blocks we need
grid synchronization. This may be a nice concrete use case @math-fehr
2. for the 1-stage fissioned implementation we need device-wide synchronization
otherwise we will always be limited by running on a single SM
3. the 2-stage fissioned implementations can give us performance today
after tuning.

Without tuning the results on the larger size (1e7, 32, 16)
are shown [here](https://gist.github.com/nicolasvasilache/8a0addfb6831a831b2dca45c612f9c2d).
`mindis_16_32_10000000` is the totally fused kernel and performs evry poorly.
The following 5 kernels correspond to the final use case of interest.
This commit adds an experimental vector type support to the lexer,
semantic analyzer, and cuda implementation.
This also adds language and end-to-end functional tests for all types.

This is limited by the fact that ATen doesn't allow such types.
Therefore this commit adds some striding genuflexions to bypass ATen
issues.
@facebook-github-bot
Copy link

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants