Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Time/Diff GATs #502

Merged
merged 6 commits into from
May 28, 2024

Conversation

frankmcsherry
Copy link
Member

@frankmcsherry frankmcsherry commented May 27, 2024

The PR slowly introduces GATs for Time and Diff. It largely reproduces work of @antiguru, but in an incremental fashion that helps me at least understand what is going on. The incremental steps are:

  1. The first commit generalizes the Layout containers to provide separate storage for times and diffs, which allows those containers to speak clearly about associated types for those containers. The GATs are immediately converted to owned data, so the transient allocations are still high (increased, even), but the storage opportunities are increased by allowing the containers to store their data in a way that doesn't need to present as &Time and &Diff.
  2. The second and third commits introduce DiffGat<'_> and TimeGat<'_> respectively, which are exactly the GATs offered by the underlying containers, but surfaced upwards through the APIs provided by cursors. Concretely this means that map_times() communicates GATs rather than references, and users of the function can avoid allocation if they do not require it. These commits did not try especially hard to prevent immediately calling into_owned() and re-invoking transient allocations.
  3. The fourth commit tries harder to eliminate calls to into_owned(), which are the moments we convert from GATs to owned types, and where we might perform needless allocation. The commit generalizes Semigroup to act on references, and e.g. the constraint that Tr::Diff : for<'a> Semigroup<Tr::DiffGat<'a>> to ensure that we can accumulate the GAT types into the owned types directly, without needing to first take ownership. Similar changes for times, though mostly around streamlining the idiom of time.join(meet) for a GAT time (take ownership, then join_assign).

There is work to continue to progressively nibble away at the reliance on the owned types, though unlike keys and values they are unlike to go away: both time and diff need to be mutated (advancing to frontiers and consolidation, respectively). This seems likely to force some owned representation, as one cannot simply rely on the existing read-only data in the backing storage.

@frankmcsherry frankmcsherry force-pushed the time_diff_gats branch 2 times, most recently from 23416e5 to 92c8ca8 Compare May 27, 2024 23:53
dogsdogsdogs/src/operators/lookup_map.rs Outdated Show resolved Hide resolved
src/operators/count.rs Outdated Show resolved Hide resolved
@frankmcsherry
Copy link
Member Author

The current state of the PR, to my understanding is:

  1. The first commit allows containers with GATs to store Time and Diff, but does not otherwise change very much. Folks still get to see &Time and &Diff, because step one is to call .into_owned() on the GATs. This introduces some new allocations, and it feels especially bad in cases where we were not previously going to allocate (e.g. using a &Time) and we should look at each of these. But importantly, this first step opens up more efficient storage at the cost of ephemeral allocations (improving RAM by spending CPU) and may already be worth landing.
  2. The second and third commits move the information about those GATs up the stack, through fn map_times returning GATs rather than references, which means the storage layer doesn't need to call .into_owned() any more. In several cases, we still just call .into_owned(), but we can start to incrementally nibble at uses of GATs that are force into owned versions and replace them with uses of GATs instead. I started this and got tripped up in some places, and so went with a minimal version that unless it is quite sure, just calls into_owned() and moves on.

I think the main remaining question is how much .into_owned() minimization must we do before landing the PR, or one like it. The into_owned() calls are a regression for someone storing String data, say, but are almost certainly a win for anyone who wants a FlatContainer or Columnation back-end. In the limit, it would be great to have our internals have Cow<'a, Diff> everywhere, to only allocate when we need to accumulate, and generally to minimize ephemeral allocations as well. But for next steps, we should probably go over the most egregious .into_owned() calls, especially when ownership was not previously required, and taking stock of them.

@frankmcsherry frankmcsherry changed the title WIP: Attempt to introduce Time/Diff GATs Introduce Time/Diff GATs May 28, 2024
@frankmcsherry frankmcsherry marked this pull request as ready for review May 28, 2024 19:14
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

src/trace/implementations/ord_neu.rs Outdated Show resolved Hide resolved
@frankmcsherry frankmcsherry merged commit 08fce58 into TimelyDataflow:master May 28, 2024
7 checks passed
This was referenced Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants