Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: use channels as iterators #48567

Closed
deanveloper opened this issue Sep 23, 2021 · 11 comments
Closed

proposal: use channels as iterators #48567

deanveloper opened this issue Sep 23, 2021 · 11 comments

Comments

@deanveloper
Copy link

Establishing a standard for iteration

Related: #43557 and #47707

Problem

With the proposals for the slices and maps packages, as well as other proposals talking about generic data structures (ie container/list and container/set), it has become clear that we need some pattern to iterate over structures.

Common iteration patterns

There are two common iterator patterns from what I have seen.

The first is a pretty standard pattern, aptly named the "Iterator Pattern". This is a pattern such that you return an iterator which is repeatedly called over time, and the iterator will return the next value each time the function is called. This is the iteration pattern that #43557 has recommended for Go. One of the great benefits of this pattern is that it is standard among many languages. However, these iterators tend to be quite cumbersome because the writer of the iterator needs to manage state over time, which can be extremely difficult, especially when dealing with complex data structures like maps, or recursive data structures like trees.

The second pattern is a bit more modern and is typically named the "Generator Pattern". In this pattern, the iterator is a single function which is passed a "yield" function (or in many languages, a yield keyword which may be only be used in Generator Functions). Each time yield is called, control goes to the caller of the iterator, and then control is given back to the iterator again, until yield is called again, or until the iterator ends. The amazing thing about the generator pattern, is that the iterator is extremely easy to write, and it ends up looking almost like a solution where one would simply append to a slice (and return the slice in the end). This is the pattern which #47707 recommends.

Using channels as iterators

In Go, we actually already have a form of iterating using the generator pattern: goroutines and channels. This can be done by making a function which returns a read-only channel. This function creates a goroutine, and sends values on the channel. It would look something like this, which is valid Go code today:

func IntRangeIter(from, to int) <-chan int {
	ch := make(chan int)
	go func() {
		defer close(ch)

		for i := from; i < to; i++ {
			ch <- i
		}
	}()
	return ch
}

func main() {
	for i := range IntRangeIter(0, 5) {
		fmt.Printf("%d, ", i)
	}
	fmt.Println()
}
// Output: 0, 1, 2, 3, 4, 

There are a few problems with this pattern in current Go. The main issue is that we must exhaust the iterator channel in order for the spawned goroutine to be destroyed. Otherwise, the goroutine in the iterator will simply hang on ch <- i forever.

The second issue is performance. Channels in Go unfortunately have a lot of overhead. In my own testing, recursive iterators took ~500x longer to use channels to iterate over a binary tree (compared to calling a function on each element). Iterating over a slice instead of a tree, it took ~100x longer to use channels.

Optimizations can definitely be made though, as using Javascript's generator pattern (which also uses coroutines in some form) can iterate over the binary tree in ~10x longer than calling a function on each element in Go.

Proposal

This proposal has three parts:

  1. Establish a standard of using channels for iterators in the standard library.
  2. Reconsider proposal: runtime: garbage collect goroutines blocked forever #19702, and garbage-collect goroutines which are blocked forever.
  3. Add optimizations for goroutines/channels so that using them as iterators does not cost so much time.

For part 1, this would mean we take actions such as adding Iter() <-chan T for container/list and container/set.

For part 2, we reconsider #19702 (such that cleanup is not done - the goroutine vanishes when it is GC'd). This allows us to spawn these goroutines, and the caller of the iterator does not need to communicate to the iterator that we are done iterating.

Part 3 is likely the least important of the three, but it is still extremely important. Currently, this pattern is two orders of magnitude slower than calling a function on each element. Languages (Kotlin, Rust) are already adopting this pattern and use a coroutine implementation, but do not see the same immense performance hits that Go does.

The wonderful thing about this update, is that nothing about the language itself needs to change. Tooling does not need to be updated, as channels are already range-able.

Example

func FibonacciIter() <-chan int {
	ch := make(chan int)

	go func() {
		defer close(ch)

		a, b := 0, 1
		for {
			ch <- a

			c := a + b
			a, b = b, c
		}
	}()

	return ch
}

// Usage
func main() {
	//          0
	//       1     4
	//      2 3   5
	t := TreeOf[int](0, 1, 2, 3, 4, 5)
	for i := range t.InOrderIter() {
		fmt.Printf("%d, ", i)
	}
	fmt.Println()
}
// Output: 2, 1, 3, 0, 5, 4, 

Other solutions

Many solutions were discussed in #43557, and I highly recommend checking them out. Here are a couple other solutions that also involve solutions to allow using channels as iterators:

  • Adding runtime.Deadlocked() to tell if the current goroutine is deadlocked.

This is my second favorite solution behind allowing goroutines to be GC'd.

Alternatively, this could be unexported, and imported (via //go:linkname) in something like chans.Generator. This would give the generator function a yield function, which would select between runtime.deadlocked() and sending on the iterator channel.

chans.Generator would look something like this:

//go:linkname deadlocked runtime.deadlocked
func deadlocked() <-chan struct

// Creates an iterator based off of a generator function.
func Generator(generator func(yield func(int))) <-chan int {
	ch := make(chan int)

	go func() {
		defer close(ch)
		generator(func(incoming int) {
			select {
			case ch <- incoming:
			case deadlocked():
				runtime.Goexit()
			}
		})
	}()
	return ch
}

// usage
func FibonacciIter() <-chan int {
	return chans.Generator(func (yield func(int)) {
		a, b := 0, 1
		for {
			yield(a)

			c := a + b
			a = b
			b = c
		}
	})
}
  • Using finalizers to allow us to close the channel.

Finalizers are a bit hacky, but it somewhat solves our problem. There is a cool example that @ianlancetaylor shows in the generics draft about Rangers. Unfortunately, these are not actually rangeable, but it does show that we could possibly use a finalizer to clean up a goroutine.

@gopherbot gopherbot added this to the Proposal milestone Sep 23, 2021
@ianlancetaylor
Copy link
Contributor

Add optimizations for goroutines/channels so that using them as iterators does not cost so much time.

I think that is essential, but it isn't simple. Do you have any thoughts on how we can make it much faster?

@deanveloper
Copy link
Author

I'm not entirely sure, as I'm not very well versed in compiler optimizations (especially for synchronization). All I really know is that it is possible (unsure how feasible), because other compiled languages such as Kotlin and Rust have coroutine mechanisms which are able to implement the generator pattern in an efficient manner

(see Kotlin Sequences and Rust Generators)

In #43557 (comment), @carlmjohnson had noticed that the iteration example runs significantly faster with GOMAXPROCS=1, and suggested trying to detect when goroutines would only ever yield to each other, and using a simpler scheduler in those cases.

@ianlancetaylor
Copy link
Contributor

Explicit coroutine mechanisms are significantly different than Go goroutines communicating over channels. I agree that if the compiler can reliably recognize the channel-as-iterator pattern when compiling the goroutine that sends the values on the channel, then we can compile that case differently and get significantly better performance. But that is a really big if. If we mistakenly think that an ordinary goroutine is a channel-generation goroutine, the program will deadlock. If we mistakenly think that a channel-generation goroutine is an ordinary goroutine, then performance will be much worse than expected.

I don't think we can just wave our hands about this issue.

One way to describe the difference between what Kotlin and Rust and Go are doing is that in Kotlin and Rust, we always know for sure that the values being generated are being consumed by the equivalent of Go's for/range loop. In Go we do not know that.

@deanveloper
Copy link
Author

deanveloper commented Sep 24, 2021

I think one of the reasons that I wanted to do this proposal is just because we have multiple iterator patterns in the standard library (reflect.MapIter vs sync.Map.Range). There have also been at least two proposals for user-definable iterators, when we really already have them as channels. However, they are slow and possibly cause goroutine leaks.

One way that we could help be more clear that a channel-generation goroutine is in fact a channel-generation goroutine could be to use a function like chans.Generator[T any](generator func(yield func(T))) <-chan T, which would function similar to the Generator function listed in the proposal. Maybe we only guarantee the optimization if this function is used (or if we want to be more conservative, only apply the optimization if this function is used).

@bcmills
Copy link
Contributor

bcmills commented Sep 24, 2021

if the compiler can reliably recognize the channel-as-iterator pattern when compiling the goroutine that sends the values on the channel, then we can compile that case differently and get significantly better performance.

I thought for a moment about somehow fusing the goroutine to the channel at creation time, but then I realized that “a channel fused to a goroutine” is structurally and semantically different from an ordinary channel, and “a goroutine fused to a channel” is structurally and semantically different from an ordinary goroutine.

So I think what we're describing is, really, a new type — perhaps a stream[T] — that supports a range statement and a send (and perhaps select) operator with syntax similar to channel syntax.

@bcmills
Copy link
Contributor

bcmills commented Sep 24, 2021

@deanveloper, one thing missing from this description is the interaction between defer and collection of the goroutine.

If the iterator is not drained, do its deferred function calls ever execute? If so, when? If not, I think the fact that they do not run would be surprising.

Similarly, would the “goroutine” begin running before the caller tries to receive the first value, or only on demand?

If it does run before the caller tries to receive the first value, we have the same off-by-one problem as with the yield-callback approach (#43557 (comment)). On the other hand, if the goroutine does not run until the first value-receive, that could lead to surprising and subtle deadlocks, since Go users today can (and do!) assume that their goroutines eventually start running.

@deanveloper
Copy link
Author

So I think what we're describing is, really, a new type — perhaps a stream[T] — that supports a range statement and a send (and perhaps select) operator with syntax similar to channel syntax.

I think part of the issue with a separate stream type, would be how similar they are to channels semantically. They would support the same operations and (essentially) function the same as a channel.

If the iterator is not drained, do its deferred function calls ever execute? If so, when? If not, I think the fact that they do not run would be surprising.

I had this typed out in the proposal, but I think I may have accidentally deleted it in a revision of it. Deferred functions should not execute, as garbage-collecting a forever-blocked goroutine should behave as if the goroutine is still blocked. This does have some implications, for instance, iterators would not be able to close file io operations, and the caller would have to open/close the file themselves.

Note that this could be solved with the runtime.deadlocked() idea, as the generator function would be able to properly return and cleanup.

Similarly, would the “goroutine” begin running before the caller tries to receive the first value, or only on demand?

My initial thought was that it would begin running before the caller tries to receive the first value. This would however result in the off-by-one issue. For most iterators this is not a big deal. But if the iterator has side effects, or if the data structure that the iterator is based on is modified, this may result in some strange behavior. Iterators (in general) shouldn't be having side effects though, as those should be saved for when the iterator is consumed. But I guess we can't guarantee that.

@deanveloper
Copy link
Author

I was thinking of an alternative proposal for this, where we introduce a new builtin, I'm not really sure what to name it (for now I guess we can call it iterator), but it's essentially a struct { a chan struct{}, b chan T }, and would have semantics like this:

(iter is a variable of type iterator T. For illustration purposes, iter.a and iter.b refer to the chans shown above, but would not actually be valid Go code)

  1. <-iter is equivalent to iter.a <- struct{}{}; <-iter.b
  2. iter <- v is equivalent to iter.b <- v; <-iter.a
  3. close(iter) is equivalent to close(iter.a); close(iter.b)
  4. A new builtin, iterwait, is equivalent to <-iter.a
  5. for v := range iter { ... } is equivalent to the following code:
iter.a<-struct{}{}
for v := range iter.b {
    ...
    iter.a<-struct{}{}
}

Just as channels, iterators are support the read-only (<-iterator T) and write-only (iterator<- T) type modifiers.

An iterator would then look like:

func IntRangeIter(from, to int) <-iterator int {
	iter := make(iterator int)
	go func() {
		defer close(iter)

		for i := from; i < to; i++ {
			iter <- i
		}
	}()
	return iter
}

func main() {
	for i := range IntRangeIter(0, 5) {
		fmt.Printf("%d, ", i)
	}
	fmt.Println()
}
// Output: 0, 1, 2, 3, 4, 

This would be a complex (and unorthogonal) solution to solve a relatively minor problem, though. I don't really like it, and it defeats the main purpose of this proposal, which would be to introduce iterators without language changes.

@bcmills
Copy link
Contributor

bcmills commented Oct 5, 2021

I think part of the issue with a separate stream type, would be how similar they are to channels semantically. They would support the same operations and (essentially) function the same as a channel.

Note that slices, arrays, and maps also all support the same operations. The ways that a stream differs from a channel seem at least as important to me as the ways in which an array differs from a slice.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

I don't think we are going to reconsider #19702 (comment). See that comment for details, which have not changed.

If there are other optimizations that can be applied, that's always fine. But we're not going to GC blocked goroutines, and therefore we're not going to use channels as iterators in the standard library.

Declining as infeasible.

@rsc
Copy link
Contributor

rsc commented Oct 6, 2021

This proposal has been declined as infeasible.
— rsc for the proposal review group

@rsc rsc closed this as completed Oct 6, 2021
@golang golang locked and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants