Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: container/heap: add Heap, a heap backed by a slice #47632

Open
cespare opened this issue Aug 11, 2021 · 30 comments
Open

proposal: container/heap: add Heap, a heap backed by a slice #47632

cespare opened this issue Aug 11, 2021 · 30 comments
Labels
generics Issue is related to generics Proposal
Milestone

Comments

@cespare
Copy link
Contributor

cespare commented Aug 11, 2021

This proposal is for use with #43651. I propose to add a parameterized Heap type to the container/heap package. For the typical usage of container/heap where the heap is backed by a slice, using heap.Heap is safer, more convenient, and more readable:

  • No conversions to or from any are required.
  • Instead of defining 5 methods (Less, Len, Swap, Push, and Pop), most of which are typically boilerplate, only the cmp function is needed.
  • Having the heap functions Push, Pop, etc. be methods is more natural than functions: pq.Push(e) vs. heap.Push(pq, e).

Update 2024-07-22: Switched from a less function to a cmp function.

Update 2021-08-28: Removed the Lesser interface; pass in a less function to the constructor.

// A Heap is a min-heap backed by a slice.
type Heap[E any] struct { ... }

// New constructs a new Heap with a comparison function.
func New[E any](less func(E, E) bool) *Heap[E]

// Push pushes an element onto the heap. The complexity is O(log n)
// where n = h.Len().
func (h *Heap[E]) Push(elem E)

// Pop removes and returns the minimum element (according to the less function)
// from the heap. Pop panics if the heap is empty.
// The complexity is O(log n) where n = h.Len().
func (h *Heap[E]) Pop() E

// Peek returns the minimum element (according to the less function) in the heap.
// Peek panics if the heap is empty.
// The complexity is O(1).
func (h *Heap[E]) Peek() E

// Len returns the number of elements in the heap.
func (h *Heap[E]) Len() int

// Slice returns the underlying slice.
// The slice is in heap order; the minimum value is at index 0.
// The heap retains the returned slice, so altering the slice may break
// the invariants and invalidate the heap.
func (h *Heap[E]) Slice() []E

// SetIndex specifies an optional function to be called
// when updating the position of any heap element within the slice,
// including during the element's initial Push.
//
// SetIndex must be called at most once, before any calls to Push.
//
// When an element is removed from the heap by Pop or Remove,
// the index function is called with the invalid index -1
// to signify that the element is no longer within the slice.
func (h *Heap[E]) SetIndex(f func(E, int))

// Fix re-establishes the heap ordering
// after the element at index i has changed its value.
// Changing the value of the element at index i and then calling Fix
// is equivalent to, but less expensive than,
// calling h.Remove(i) followed by a Push of the new value.
// The complexity is O(log n) where n = h.Len().
// The index for use with Fix is recorded using the function passed to SetIndex.
func (h *Heap[E]) Fix(i int)

// Remove removes and returns the element at index i from the heap.
// The complexity is O(log n) where n = h.Len().
// The index for use with Remove is recorded using the function passed to SetIndex.
func (h *Heap[E]) Remove(i int) E

Notes

I evaluated the uses of container/heap in our company's internal code as well as my own projects and found that all of them used a heap backed by a slice. This indicates that a slice-based implementation could save a good deal of boilerplate in much the same way that sort.Slice saves boilerplate over sort.Sort.

A simpler API would do away with the "slice"-ness, leaving the implementation unspecified and removing the index-based methods:

type Heap[E any] struct { ... }

func NewHeap[E any](less func(E, E) bool) *Heap[E]
func (h *Heap[E]) Push(elem E)
func (h *Heap[E]) Pop() E
func (h *Heap[E]) Peek() E
func (h *Heap[E]) Len() int

However, my review of container/heap usage showed that this simplified Heap would cover fewer than half of those real-life use cases. Many real uses of container/heap need the index-based methods Fix or Remove; for instance, to delete items from a priority queue.

A different simplification would be to use constraints.Ordered elements rather than the Lesser constraint. While this makes for concise example code, in practice, such heaps seem relatively rare; all the uses I analyzed had slice elements of some compound type.


Here is a playground implementation of the proposal showing the priority queue example implemented using Heap.

@gopherbot gopherbot added this to the Proposal milestone Aug 11, 2021
@ianlancetaylor ianlancetaylor added the generics Issue is related to generics label Aug 11, 2021
@rsc
Copy link
Contributor

rsc commented Aug 11, 2021

I'm worried about doing too much in the standard library in Go 1.18.
We don't really know what best practices for containers are going to be.
I'm wary of committing to a container/set API as well as to committing to this API.

@ianlancetaylor
Copy link
Contributor

type Lesser[E any] interface {
	Less(E) bool
}

type Heap[E Lesser[E]] struct { ... }

I don't think this is going to be the right path for generic data types. I think it is better to initialize the heap with a comparison function. My argument is that if you have a comparison function, it's tedious to create a type with a Less method that invokes that comparison function. But if you have a Less method, it's trivial to use it as a comparison function, by simply passing elementType.Less.

I admit that this means that the zero value can't be used directly. Still, I think that overall it is the more flexible approach.

@ianlancetaylor
Copy link
Contributor

The docs should mention what happens for Pop and Peek if the heap is empty.

I'm not quite seeing what SetIndex is for. At first I thought it could be used to maintain a secondary data structure, but in that case I think it needs the index when an element is removed.

@cespare
Copy link
Contributor Author

cespare commented Aug 11, 2021

The docs should mention what happens for Pop and Peek if the heap is empty.

Done.

@cespare
Copy link
Contributor Author

cespare commented Aug 11, 2021

@ianlancetaylor

I'm not quite seeing what SetIndex is for. At first I thought it could be used to maintain a secondary data structure,

The playground example I linked is an adapted version of the PriorityQueue example from the container/heap package which makes use of SetIndex.

In general, I would expect it to be used to associate each element with its index somehow, whether that's as part of the element (as in my example) or using some secondary data structure such as a map[E]int.

but in that case I think it needs the index when an element is removed.

I don't follow this. SetIndex is called when an element's index is changed; the index provided to the callback is the new index (so -1 when the element is removed); it sounds like you're saying we would need the old index (instead? in addition?). I don't see why, though -- for example, if your external data structure is a map[E]int, then when the callback is called with f(e, -1), the code would delete(m, e).

@ianlancetaylor
Copy link
Contributor

If the secondary data structure is another slice, then, you're right, we need both the old and the new index.

@rsc
Copy link
Contributor

rsc commented Aug 25, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@cespare
Copy link
Contributor Author

cespare commented Aug 26, 2021

@ianlancetaylor yes, an auxiliary parallel slice that mirrors all the swaps of the Heap slice is a situation in which it would be convenient to have the old index as well.

I'm not sure why you'd ever want to maintain a parallel slice, though. If you have some extra data to associate with each heap element, why not add it to the element itself?

@cespare
Copy link
Contributor Author

cespare commented Aug 26, 2021

I don't think this is going to be the right path for generic data types. I think it is better to initialize the heap with a comparison function.

OK, after some thought, I agree that this will make many use cases simpler. I updated the proposal to use a less comparison function passed to a constructor (New).

@kylelemons
Copy link
Contributor

I'm also in favor of less being a function passed into the constructor. In my usages, I often use heap with a local type (inside the function) which cannot have a method. The reverse is possible: a type with a less method can pass the method value to the constructor.

@rsc
Copy link
Contributor

rsc commented Sep 1, 2021

It sounds like we should put this on hold for after Go 1.18. Any objections to that?

@rogpeppe
Copy link
Contributor

rogpeppe commented Sep 2, 2021

// A Heap is a min-heap backed by a slice.
type Heap[E any] struct { ... }

another possibility:

type Heap[E any] struct {
    Slice []E
    Less func(E, E) bool
}

@Merovius
Copy link
Contributor

Merovius commented Sep 7, 2021

I'm generally in favor of exporting Less from the struct, as it allows to avoid a pointer-indirection if the heap is used as a struct field.

Furthermore, one advantage of the original Lesser[E] variant would then also be that the extra func field doesn't have to be dragged around, making the type smaller. I do agree with @ianlancetaylor that the func version is more helpful in practice, but I think this drawback should be mentioned. It doesn't matter much as long as Heap can only be used as a pointer, though.

@rsc
Copy link
Contributor

rsc commented Sep 8, 2021

Placed on hold.
— rsc for the proposal review group

@cespare
Copy link
Contributor Author

cespare commented Dec 1, 2021

Over at #47619 (comment) @anacrolix asked about a concrete implementation of a generic slice-based heap. In my original proposal here, I linked to a very quick'n'dirty one in the playground (that simply wraps container/heap's current interface), but it would be interesting to do a proper implementation and compare some benchmarks to see how the type-parameterized version compares to the heap.Interface one. I might get around to that at some point but if someone else wants to write it up, feel free.

@adonovan
Copy link
Member

adonovan commented Dec 1, 2021

I linked to a very quick'n'dirty one in the playground (that simply wraps container/heap's current interface), but it would be interesting to do a proper implementation and compare some benchmarks to see how the type-parameterized version compares to the heap.Interface one.

If you do that, please measure the code segment size too. I'm curious to know how much expansion to expect from multiple instantiations of a generic type.

In C++ it was occasionally worth defining a widely used template as a mere wrapper around an untyped generic implementation, to reduce the quantity of code duplicated for each new template instantiation. The GNU linker was smart enough to coalesce templates that were essentially identical (e.g. vector of X* vs Y*). I don't know what the Go toolchain's notion of equivalence is, and presumably it varies across implementations.

@SaveTheRbtz
Copy link

SaveTheRbtz commented Apr 10, 2023

It may be a bit awkward to have Pop and Peek to panic if heap is empty, since now caller needs to always keep track of the number of elements in the heap. Maybe, use the ok bool instead:

func (h *Heap[E]) Pop() (E, bool)
func (h *Heap[E]) Peek() (E, bool)

Google's generic BTree implementation uses this approach.

As for the any vs Ordered, I think there can be separate constructors, one wrapping another. Again, BTree has both New and NewOrdered. Not too pretty, but may save some boilerplate for tests and trivial types.

PS. @rsc, since 1.18 is long out, maybe this issue can be unheld?

@Merovius
Copy link
Contributor

@SaveTheRbtz I don't think you have to keep track. You only have to call h.Len to find out. Personally, I don't understand the decision of that API. After all, we don't require a two-variable assignment when accessing a slice, but trust programmers to check if len(s) > 0 first. I would personally prefer if h.Pop() was an expression I could put into a function call - almost every case where I used a heap in the past has been in a for h.Len() > 0 loop anyways.

@cespare
Copy link
Contributor Author

cespare commented Apr 10, 2023

It may be a bit awkward to have Pop and Peek to panic if heap is empty, since now caller needs to always keep track of the number of elements in the heap.

Well, not exactly "keep track"; they just need to call Len.

With the proposed signature, typical Heap-using code looks like this:

for h.Len() > 0 {
	e := h.Pop()
	// ...
}

With the boolean result, the loop looks like:

for {
	e, ok := h.Pop()
	if !ok {
		break
	}
	// ...
}

This particular case doesn't seem like much of an improvement.

I'm sympathetic to the notion that (E, bool) is a bit safer since it better encourages the caller to be aware of the empty case. OTOH the existing heap API doesn't provide the bool. On balance I prefer without the bool.

@flowchartsman
Copy link

flowchartsman commented Nov 15, 2023

If you do that, please measure the code segment size too. I'm curious to know how much expansion to expect from multiple instantiations of a generic type.

@adonovan :

I had a need to play around with this, so I did an implementation and wrote some code generation to compare generic types which you can see here: https://go.dev/play/p/aQWbwOAnfXI

attached is resultant generated source code and the output of go tool nm -size -sort size -type for each binary. I was somewhat pleasantly surprised, though it's possible I'm doing something wrong.

cmdgen_main.go.txt
heapgeneric_nm_complete.txt
heapgeneric_nm_gheap_types.txt
heapstd_nm_complete.txt
heapstd_nm_heap_types.txt
main_generic.go.txt
main_std.go.txt
gheap.go.txt

@earthboundkid
Copy link
Contributor

A) This proposal can probably come off of hold now.

B) I would like to see this implemented as container/heap/v2 with this api:

package heap

func New[E any](cmp func(E, E) int) *Heap[E]  // Instead of a less function
func (h *Heap[E]) All() iter.Seq2[int, E] // No Slice() method

// Other methods as proposed above.

@anacrolix
Copy link
Contributor

I have been using this form: https://github.com/anacrolix/generics/blob/main/heap/slice.go#L46-L51. It tries to avoid the heap allocation of Heap in func New[E any](cmp func(E, E) int) *Heap[E] above. It also lets the caller maintain ownership over the slice. In my use case I have a slice that moves around parts of the program, and at one point needs to be heapified, which I do by wrapping it in Interface[T] and passing that to the heap functions. I still maintain ownership afterwards and return to reuse it elsewhere in the program. Also note the rest of github.com/anacrolix/generics/heap is just a direct translation of container/heap to generics. I use this in a very performance critical part of anacrolix/torrent.

@earthboundkid
Copy link
Contributor

That's interesting. One idea would be to add heap.Slice[E any] to heap v1 and keep using the old interface functions, and also add a new heap v2 package that hides the slice and does its own memory management.

@tamirms
Copy link

tamirms commented Apr 28, 2024

I suggest changing the proposed implementation of the Pop() function on sliceHeap:

func (s *sliceHeap[E]) Pop() interface{} {
	var zero E
	e := s.s[len(s.s)-1]
	if s.setIndex != nil {
		s.setIndex(e, -1)
	}
	// avoid memory leak by clearing out popped value in slice
	s.s[len(s.s)-1] = zero
	s.s = s.s[:len(s.s)-1]
	return e
}

s.s[len(s.s)-1] = zero will avoid retaining a reference to the popped value in the slice

@xibz
Copy link

xibz commented Jul 14, 2024

I'm worried about doing too much in the standard library in Go 1.18.
We don't really know what best practices for containers are going to be.
I'm wary of committing to a container/set API as well as to committing to this API.

But we already have a heap implementation that doesnt use generics... So we are stuck with that implementation until some "best" practice API is found. Specifically for something like containers that may be very opinionated.

I think this poses a lot of hesitation which is not necessarily a bad thing, and it is important to note any feature added to Go should be considered forever. However with the introduction of the slice package it seems hypocritical to not do it for also containers at this point.

Given that we are much past Go 1.18, we should really relook at this. And we could always add this to the experimental package until we get it right. My hope was that this was not implemented due to the amount of features and content within v1.18 and nothing more. I suggest removing the "proposal hold"

@rossb83
Copy link

rossb83 commented Jul 22, 2024

A) This proposal can probably come off of hold now.

B) I would like to see this implemented as container/heap/v2 with this api:

package heap

func New[E any](cmp func(E, E) int) *Heap[E]  // Instead of a less function
func (h *Heap[E]) All() iter.Seq2[int, E] // No Slice() method

// Other methods as proposed above.

What are thoughts to give a fixed priority for each element at its input into the heap instead of a global less function?

type Heap[E any, P cmp.Ordered] struct { ... }

func (h *Heap[E, P]) Push(elem E, priority P)

The less function is certainly more convenient, but it also may hide undesirable behavior. Take for instance a heap of heap of heap of int. A less function may be written to require a sequence of recursive calls as elements move up and down the heap and the sub-heaps to repeatedly find the min. Aside from expense, the less function may hide bugs; for example a developer makes the less function compute seconds remaining until a certain date (or something of that nature).

A heap relies on comparisons being both fixed and cheap to be useful in practice and requiring an exact priority per element ensures there can be no surprises.

@Merovius
Copy link
Contributor

I believe in every case I've needed a heap so far, the Push(E, P) API would have been more fitting and I had to emulate it myself, for what it's worth.

@cespare
Copy link
Contributor Author

cespare commented Jul 22, 2024

@earthboundkid regarding this suggestion:

package heap

func New[E any](cmp func(E, E) int) *Heap[E]  // Instead of a less function
func (h *Heap[E]) All() iter.Seq2[int, E] // No Slice() method

// Other methods as proposed above.
  • I agree with switching from less to cmp since that fits better with the newer generic sorting-related APIs. I made that change.
  • I disagree with making the heap an iterator. The purpose of Slice (which is definitely an "advanced" method that not every caller would need) is to let the caller actually get at the underlying slice data. An iterator doesn't satisfy this need, and it also encourages misuse of the heap as a sorting mechanism (where a bunch of elements are inserted and then the whole thing is iterated until empty). It would be better for the caller to just append all their elements to a slice and then use slices.SortFunc.

@cespare
Copy link
Contributor Author

cespare commented Jul 22, 2024

@rossb83 / @Merovius: I disagree with replacing the comparison function with a priority cmp.Ordered.

In the use cases I've studied, it's very common that there is no single priority number and the comparison function compares multiple attributes over a sequence of tie-breaking steps. Here's a simple example:

func (h *workQueueHeap) Less(i, j int) bool {
	c0 := (*h)[i]
	c1 := (*h)[j]
	if c0.active != c1.active {
		return c0.active < c1.active
	}
	return c0.start.Before(c1.start)
}

This one just has two comparisons, but I have other examples that have 4 or more.

Turning that kind of thing into a single comparison integer (for example) is annoying to write and error-prone (remember to consider overflow). By contrast, if you already have a simple priority integer, the cmp function is a one-line cmp.Compare call.

The less function is certainly more convenient, but it also may hide undesirable behavior. Take for instance a heap of heap of heap of int. A less function may be written to require a sequence of recursive calls as elements move up and down the heap and the sub-heaps to repeatedly find the min.

I don't really follow this objection. I've reviewed a bunch of uses of container/heap and I don't recall anything like this being an issue.

@nicholas-eden
Copy link

Can we come back to this? What do we need to get this into the standard library? A priority queue is a pretty common tool to have in standard libs. I believe the proposed interface looks like it would fit 90% of use cases and save devs a lot of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generics Issue is related to generics Proposal
Projects
Status: Incoming
Development

No branches or pull requests