Skip to content

Conversation

@phate
Copy link
Owner

@phate phate commented Apr 26, 2025

We currently have two DNE transformations in the code base: one for LLVM dialect and one for HLS dialect. The reason why we duplicated them in the first place was because the HLS dialect required to handle additional structural nodes (loop node) that was not handle in the LLVM DNE.

This PR introduces a way to extend DNE for custom structural nodes such that the two transformations can be unified and we can remove the duplicated code. This PR is meant as a POC and only shows the idea for the gamma node. For the other node, it would be possible to refactor it in a similar fashion. However, I would like to get some feedback to ensure that this is on the right track before I invest the time to convert the rest of the transformation.

@phate phate requested review from caleridas, haved and sjalander April 26, 2025 19:42
@phate
Copy link
Owner Author

phate commented Apr 26, 2025

@haved @caleridas @sjalander Could you provide me with some feedback such that I know whether this goes in the right direction, please?

@caleridas
Copy link
Collaborator

@haved @caleridas @sjalander Could you provide me with some feedback such that I know whether this goes in the right direction, please?

That might be a necessity, but TBH the separate type of loop is really terrible to begin with. There is almost no point to it as everything hls_loop needs, Theta already does -- hls_loop is a "constrained subset" of normal theta capabilities, I would really like if users of hls_loop could look into formulating it this way.

@sjalander
Copy link
Collaborator

@caleridas I don't fully understand hls_loop being a subset of theta.

The loop explicitly models the backedge with its own result and argument. So each loop variable has 1 input, 2 arguments, 2 results, and 1 output. This is such that in hardware one can select between the input and the backedge with a multiplexer. And between backedge and output.

Is this possible to model with a version of the existing theta?

@caleridas
Copy link
Collaborator

caleridas commented Apr 27, 2025

@caleridas I don't fully understand hls_loop being a subset of theta.

The loop explicitly models the backedge with its own result and argument. So each loop variable has 1 input, 2 arguments, 2 results, and 1 output. This is such that in hardware one can select between the input and the backedge with a multiplexer. And between backedge and output.

Is this possible to model with a version of the existing theta?

Yes, in the following way: for each "logical" loop variable X, create two theta loop variables I and B.

  • feed the input into I, feed "undefined" or "zero" into B before loop entry
  • at beginning of loop iteration, demultiplex between I and B as you do now
  • at end of loop iteration feed the output into B, feed I into itself (this is "passthrough")
  • after loop exit, take the output from B, ignore/discard I

so I effectively is a "loop invariant constant". I had already entertained the thought whether we want to explicitly model loop invariant variables separate from loop-carried inputs, but maybe that's a story for another time.

The big primary "issue" is that the structure of loop input / loop carried / loop output variables conforming to the above must either be preserved by all transformation steps, or there must be some facility to restore this structure. Neither is conceptually really hard, but I admit that it is not terribly joyful work to iron out all corner cases. One conceptually good way to do it would be to treat such "special form" requirements as additional data to the ThetaNode, hooking this up correctly would in particularly allow to detect any constraint violation instantly at each transformation step. On the plus side, going this route probably allows to immediately leverage all loop transformations across backends.

I am not opposed to parameterizing DNE in a way such that it can be consolidated today already as the above is non-negligible work. This is not mutually exclusive with reworking the loop node longer-term, but I think it is worth keeping this in mind going forward as an option at least.

One additional comment: One logical evolution path for Theta is also to generically distinguish between "loop carried back-edge" and "output variables". That would I think also address part another part of what hls_loop does differently right now. In particular, it would allow to have an entirely distinct set of "back edge" vs. "output" values. The reason this would be useful outside hls is that this captures the exact signature of tail-recursive function calls.

@haved
Copy link
Collaborator

haved commented Apr 28, 2025

@caleridas The suggestions you bring up are things I have considered as well, but never quite come up with a nice way of representing.

For example, it is typical for a Theta to have a bunch of results that are always Undef, except for when the predicate is false and the loop is exiting. These become region arguments that should not be used. Conversely there are results that are always Undef when the loop is actually exiting, which leads to Theta output that should never be used.
I am not sure if this is actually a problem when writing compiler passes, but as a human looking at the graph it is a bit extra work to remember which theta results are actually "only" exit variables, and which are "only" loop variables.

I do not know how one would go about changing that, however, without doing large changes. If one were to introduce separate context variable, loop variable and exit variable, it would break all the symmetry between indices, which I am not sure it is worth?

One final question I have is regarding the special hls_loop. It handles memory port routing very differently to traditional loop variables. The arguments can produce multiple memory responses each iteration, and the results can consume multiple memory requests per iteration. Neither port is synced up with the actual looping of the theta. Is this also representable using a regular Theta node? Outside the realm of hls, are there other situations where we would like to represent such "sideways" flow of data in and out of regions?

@caleridas
Copy link
Collaborator

@caleridas The suggestions you bring up are things I have considered as well, but never quite come up with a nice way of representing.

For example, it is typical for a Theta to have a bunch of results that are always Undef, except for when the predicate is false and the loop is exiting. These become region arguments that should not be used. Conversely there are results that are always Undef when the loop is actually exiting, which leads to Theta output that should never be used. I am not sure if this is actually a problem when writing compiler passes, but as a human looking at the graph it is a bit extra work to remember which theta results are actually "only" exit variables, and which are "only" loop variables.

I think it would be useful to discuss this in detail because it is more than just an aesthetic issue.

I do not know how one would go about changing that, however, without doing large changes. If one were to introduce separate context variable, loop variable and exit variable, it would break all the symmetry between indices, which I am not sure it is worth?

One important point in introducing abstraction such as "LoopVar" or "ContextVar" is to also reformulate the API in these terms and stop making assumptions on indices. For operational nodes, indices identify the "meaning" of an operand (e.g. divisor vs dividend), but for structural nodes that is a bit less of a thing.

One final question I have is regarding the special hls_loop. It handles memory port routing very differently to traditional loop variables. The arguments can produce multiple memory responses each iteration, and the results can consume multiple memory requests per iteration. Neither port is synced up with the actual looping of the theta. Is this also representable using a regular Theta node? Outside the realm of hls, are there other situations where we would like to represent such "sideways" flow of data in and out of regions?

I think I do not understand what you mean by "memory port routing different to traditional loop variables" -- my naive understanding would be that memory ports are states with ops capable of delivering different values each iteration. This is fine, and basically mirrors the behavior of "load". (That however implies another interesting question that I do not know the answer to -- does it mean that hls_loops implicitly also have an implicit relation to clock? that would be a bigger issue than any state questions).

I think that in any case there is value if generic theta can represent what you need, and if it needs extending to do so I think that is still a good idea.

That said, the approach taken in this PR to just move forward is perfectly reasonable.

@phate
Copy link
Owner Author

phate commented Apr 28, 2025

@haved @caleridas @sjalander This spiraled a little bit out of control. The discussion is less about the current approach, but whether it should be taken in the first place due to the HLS loop. This is definitely worth a discussion, but HLS loop won't go anywhere any time soon. Thus, unless any one of still has any objections to the approach, I would just move ahead and produce a proper PR. @caleridas already sanctioned the approach. Any objections from your side @haved and @sjalander ?

@sjalander
Copy link
Collaborator

sjalander commented May 4, 2025

@phate
This optimization lives in llvm, but is intended to be made generic such that it can be applied to both llvm and hls.
Would it make sense to have the generic part live in jlm/opt or jlm/rvsdg/opt instead?

The following code example from the PR:
jlm::llvm::DeadNodeElimination dne({ llvm::DNEGammaNodeHandler::GetInstance() });
would be kept as it currently is in master:
jlm::llvm::DeadNodeElimination dne();

Instead there would be a
class jlm::rvsdg::DeadNodeElimination
and one would define instances of this for each backend

class jlm::llvm::DeadNodeElimination: public jlm::rvsdg::DeadNodeElimination

jlm::llvm::DeadNodeElimination()
  : jlm::rvsdg::DeadNodeElimination({ jlm::llvm::DNEGammaNodeHandler::GetInstance() });
{
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants