Skip to content

Commit

Permalink
Add README for Iterator<A> explaining design and usage
Browse files Browse the repository at this point in the history
Introduce a comprehensive README detailing the purpose, design, and usage patterns of `Iterator<A>`. This document explains how it transforms the limitations of `IEnumerator<A>` into a functional, thread-safe, and immutable structure. Examples and recommendations are provided to clarify best practices and potential pitfalls.
  • Loading branch information
louthy committed Jan 16, 2025
1 parent 941f0fd commit e2d024e
Showing 1 changed file with 256 additions and 0 deletions.
256 changes: 256 additions & 0 deletions LanguageExt.Core/Immutable Collections/IteratorAsync/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
`Iterator<A>` is a functional-wrapper for `IEnumerator<A>`. The abstraction leaks a little, so it's worth
understanding how it works by reading the details below. On the whole it behaves like an immutable stream
that caches values as it goes, but there's some footguns that you should be aware of so that they can be
avoided.

## Problem: `IEnumerator<A>`

* It is mutable which means using it in immutable data-structures is problematic.
* If you pass an `IEnumerator` reference to two threads, each thread can
call `MoveNext`and it will move the enumeration position for other thread, or even worse, move past the end
of the sequence due to race conditions.
* Enumerators start before the first item and use a complicated mechanism for accessing and testing the validity of
the element value.

_Nobody in their right mind would invent an interface like `IEnumerator<A>` today._

## Solution: `Iterator<A>`

`Iterator<A>` still uses `IEnumerator<A>` internally, but it makes it thread-safe and functional. From the outside
the type acts and works exactly like any other immutable sequence, but internally it does some quite complex
processing to achieve this with an `IEnumerator<A>` reference.

> You may say "Why not just drop `IEnumerator<A>`?" - which is a completely valid position to hold. Unfortunately,
> `IEnumerable` and `IEnumerator` are baked into the CPS state-machine that is used for `yield return` and
> `yield break`. So, we don't get to ignore those types, and instead we need to make them play nice.
`IEnumerable<A>` has a method called `GetEnumerator()` which is used to access an `IEnumerator<A>`. A new extension
method is available called `GetIterator()`, this will yield an `Iterator<A>`.

You can pattern-match an `Iterator<A>` like a functional 'cons' linked-list type:

```c#
static A Sum<A>(Iterator<A> iter) where A : INumber =>
iter switch
{
Iterator<A>.Nil => A.Zero,
Iterator<A>.Cons(var x, var xs) => x + Sum(xs),
}
```
Or, use `IsEmpty` and `Head`:
```c#
static A Sum<A>(Iterator<A> iter) where A : INumber =>
iter.IsEmpty
? A.Zero
: iter.Head + Sum(iter.Tail);
```
Or, use built-in operators:
```c#
static A Sum<A>(Iterator<A> iter) where A : INumber =>
iter.Fold(A.Zero, (s, x) => s + x);
```
Or, take an imperative approach:
```c#
static A Sum<A>(Iterator<A> iter) where A : INumber
{
var total = A.Zero;
while(!iter.IsEmpty)
{
total += iter.Head;
iter = iter.Tail;
}
return total;
}
```

`Iterator<A>` is `abstract` and the first type returned from `GetIterator()` will be a `Iterator<A>.ConsFirst`, this type implements
`Iterator<A>`. The internal fields that `ConsFirst` contains are these:
```c#
IEnumerable<A> enumerable;
int firstAcquired;
Iterator<A>? firstValue;
```
So, you can see that it doesn't actually have an `IEnumerator<A>` as this point.

The two key properties of `Iterator<A>` are `Head` (for accessing the current item) and `Tail` (for accessing
the remaining items), so let's look at those for `ConsFirst`:
```c#
public override A Head =>
First.Head;

public override Iterator<A> Tail =>
First.Tail;
```
They both access `First`, which is:

```c#
Iterator<A> First
{
get
{
if (firstAcquired == 2) return firstValue!;

SpinWait sw = default;
while (firstAcquired < 2)
{
if (Interlocked.CompareExchange(ref firstAcquired, 1, 0) == 0)
{
try
{
var enumerator = enumerable.GetEnumerator();
if (enumerator.MoveNext())
{
firstValue = new ConsValueEnum(enumerator.Current, enumerator);
}
else
{
enumerator.Dispose();
firstValue = Nil.Default;
}

firstAcquired = 2;
}
catch (Exception)
{
firstAcquired = 0;
throw;
}
}
else
{
sw.SpinOnce();
}
}

return firstValue!;
}
}
```
This all looks quite complex, but you should be able to see that the `Interlocked.CompareExchange`
then-block is where the `IEnumerator` is created. We then either set `firstValue` to a new `ConsValueEnum` with
the head-item and the `enumerator` as arguments; or we set it to `Nil`.

Upon success we set `firstAcquired` to `2`. So, subsequent calls to `First` will just return `firstValue`. This
locking technique without using locks is a way to efficiently protect the enumerator from race-conditions.

So, upon first access to either `Head` or `Tail` we launch the `IEnumerator` and cache the first item in the sequence.
All subsequent access goes to `Head` or `Tail` on either `Nil` or `ConsValueEnum`. We never touch the `IEnumerator`
again in `ConsFirst`.

The `Nil` implementation isn't so surprising:
```c#
public override A Head =>
throw new InvalidOperationException("Nil iterator has no head");

public override Iterator<A> Tail =>
this;
```
`ConsValueEnum` is where it gets interesting. It has the following internal fields:
```c#
Exception? exception;
IEnumerator<A>? enumerator;
int tailAcquired;
Iterator<A>? tailValue;
```
It also has a `Head` property that is set in the constructor:
```c#
public override A Head { get; }
```
So, we can access the `Head` value at any time, but the `Tail` value isn't yet set:
```c#
public override Iterator<A> Tail
{
get
{
if(tailAcquired == 2) return tailValue!;
if(tailAcquired == 3) exception!.Rethrow();

SpinWait sw = default;
while (tailAcquired < 2)
{
if (Interlocked.CompareExchange(ref tailAcquired, 1, 0) == 0)
{
try
{
if (enumerator!.MoveNext())
{
tailValue = new ConsValueEnum(enumerator.Current, enumerator);
}
else
{
enumerator?.Dispose();
enumerator = null;
tailValue = Nil.Default;
}

tailAcquired = 2;
}
catch (Exception e)
{
exception = e;
tailAcquired = 3;
throw;
}
}
else
{
sw.SpinOnce();
}
}

if(tailAcquired == 3) exception!.Rethrow();
return tailValue!;
}
}
```
This does a similar thing to `ConsFirst` of protecting a section with `Interlocked.CompareExchange`. So, we can
only ever access the 'then' part [of that `if` statement] once. In that block we `MoveNext` the `IEnumerator`
which will either return `true` or `false`.

If `true` then we create another `ConsValueEnum`, if `false` then we use `Nil`. Whichever is created gets assigned
to `tailValue` and `tailAcquired` gets set to `2`. That means subsequent calls to `Tail` will just return
`tailValue`.

That process continues for each item of the sequence until the `IEnumerator` runs out of items to yield. The end
result is a linked-list of `ConsValueEnum` objects that have a `ConsFirst` object at the head of the linked-list.

So, `Iterator<A>` effectively caches the sequence as you go. If you hold on to the head of the sequence then the
whole list may end up in memory at once. This could be problematic when working with large lazy sequences or even
infinite sequences.

This for example is fine:
```c#
for(var iter = Naturals.GetIterator(); !iter.IsEmpty; iter = iter.Tail)
{
Console.WriteLine(iter.Head);
}
```
Because the `iter` reference keeps getting updated in-place, meaning that nothing is holding on to the head-item in
the sequence, and so the garbage-collector can collect those unreferenced items.

Whereas this will cause memory-usage to grow and grow:
```c#
var start = Naturals.GetIterator();
for(var iter = start; !iter.IsEmpty; iter = iter.Tail)
{
Console.WriteLine(iter.Head);
}
```
Because `start` is holding a reference to the first item, so it must hold a reference (indirectly) to every
subsequent item. Meaning the garbage-collector can't collect.

To get around this you can use `Clone`:
```c#
var start = Naturals.GetIterator();
for(var iter = start.Clone(); !iter.IsEmpty; iter = iter.Tail)
{
Console.WriteLine(iter.Head);
}
```
This creates a new 'head' for the sequence and so `iter` is the only reference, meaning updates to `iter`
make the head elements free for garbage collection.

So, `Iterator<A>` is much, much more powerful than `IEnumerator<A>`. It is mostly useful for immutable data-types
that need to carry an `IEnumerator<A>`, but can't due it its limitations. `Iterator<A>` has some limitations of its
own, but they are relatively easy to workaround, whereas that isn't the case with `IEnumerator<A>` (without writing
a type like `Iterator<A>`!).

0 comments on commit e2d024e

Please sign in to comment.