Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join, half_join: add lower-level interfaces #327

Merged
merged 6 commits into from
Aug 9, 2021

Conversation

uce
Copy link
Contributor

@uce uce commented May 27, 2021

@frankmcsherry I'm unsure whether the params of the "unsafe" closure make sense this way for the usages you have in mind.

Some alternatives could be:

  1. Provide time1 and lub(time1, time2) separately
  2. Provide diff1 and diff2 separately

I'm happy to move things around after some input.

If the general direction looks good, I can follow-up with the other operators.

@frankmcsherry
Copy link
Member

The structure here looks good. I had a few thoughts about ways it could be tidied, but morally things like good right now.

  1. I thought you could probably skip the Tr::R: std::ops::Mul<Tr::R, Output=Tr::R> constraint, as the multiplication happens in the closure now. I now think that this isn't the case, because let diff = count * diff.clone();. However, your (2.) is roughly what might make sense (it is what lookup_map does). There, it is helpful because some instances of the operators want to apply distinct to the arranged state, and that is easily done just ignoring the second count.
  2. You could probably go even farther, and just allow I: IntoIterator<Item=DOut> where it is up to the user to pick the type of iterated output. This would allow them to e.g. change the R type if they want, or generally do whatever they need in the output (maybe: pair the output up differently). The downside here is that we wouldn't be able to have the operator compact the output post-output_func.

Your (1.) seems interesting, but maybe great to hold off on for now. At least, there is value in joining the times to allow the compaction (otherwise we probably expect a multiplicative increase in the number of outputs, as each input record hits ~idk lots of records that could be collapsed). Perhaps in some more general future we hand the cursor instance over to the closure.

@frankmcsherry
Copy link
Member

Related: I think there may be a latent bug (not yours, but I'd love to get your thoughts) where two records in proposals have the same key. It naively looks like we'll walk the cursor forward, and then the next record (with that same key) will "seek" from past the key already (and .. seek doesn't work that way).

I'm not sure for how long this has been here, but if the above is correct I've wrangled with it before in other implementations. It is a bit gross here because each record has a different initial, which means the accumulation logic is different. It could probably be "fixed" with a rewind_vals call at the end of each logic (so one is always left positioned at the beginning of the key one started with, which should allow seek_key to not overshoot.

@frankmcsherry
Copy link
Member

Bah, nevermind. Apparently old Frank is smarter than new Frank, and pre-wrote this in there:

                                }
                                cursor.rewind_vals(&storage);

@uce uce marked this pull request as ready for review June 2, 2021 13:24
@uce
Copy link
Contributor Author

uce commented Jun 8, 2021

@frankmcsherry Sorry for the long silence here!

  1. It makes sense to lift the Tr::R: std::ops::Mul<Tr::R, Output=Tr::R> constraint. I've pushed a fixup in 16847ac to provide diff1 and diff2 separately. The only downside is that this introduces another clone() for the closure of the existing half_join.

  2. I do like the idea of making things more general as outlined in your 2nd point, but I have a few questions:

    1. Are you referring to calls to consolidate when you say compaction or are you referring to something else? Note that we actually don't call consolidate after the output func, yet, but we could!
    2. I'm unsure how I: IntoIterator<Item=DOut> would work. I can see that we would want to return Collection<G, DOut, ROut> instead of Collection<G, DOut, Tr::R>, but that would still imply Item=(DOut, ROut), right? Is that what you meant? I'm probably missing something here. In any case, I've pushed a small fixup to replace Tr::R with ROut in 0a26469.

@uce
Copy link
Contributor Author

uce commented Jun 8, 2021

  1. Regarding the extra clone() for multiply: Using differential_dataflow::difference::Multiply gets rid of extra clone(), because it works against &Rhs instead of Rhs. Any downside in using that constraint instead of std::ops::Mul?

@uce uce changed the title half_join: add unsafe variant join, half_join: add lower-level interfaces Jun 8, 2021
@uce
Copy link
Contributor Author

uce commented Jun 8, 2021

@frankmcsherry I added another commit for the lower-level join interface. Maybe I should have moved it to a separate PR, let me know if you would prefer that way.

Your answer to point 2 above might be applicable to join as well.

@frankmcsherry
Copy link
Member

Sorry for the latency. Things look good at the moment, but I wanted to get to your questions.

I'm unsure how I: IntoIterator<Item=DOut> would work.

I was imagining that the method could return a Stream<G, DOut> which has a zero cost conversion to Collection in the case that DOut has the structure of (D, T, R) for some triples. This would let the user produce whatever stream they want, commonly a collection but idk if it needed to be a different stream it wouldn't need a map operator to change its shape. By "compact" I did indeed mean "consolidate". Oops.

Using differential_dataflow::difference::Multiply ...

This is probably a great idea! We shifted because several diff types don't implement Mul (like tuples) and Multiply is what the rest of DD uses.

No rush to land things, but I think all of these are in the right direction, and even if things land we might end up iterating a bit as we use them, so don't worry much about the nits.

@frankmcsherry frankmcsherry merged commit 65d1a5a into TimelyDataflow:master Aug 9, 2021
This was referenced Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants