Combine, an iterator adapter which statefully maps multiple input iterations to a single output iteration #379
Description
Proposal
Problem statement
A common iteration pattern I find myself wanting an adapter for is taking several iterations from a wrapped iterator which accumulate in state shared across those iterations to produce a single element from the wrapping iterator.
Motivating examples or use cases
One of the simplest usage examples for this might be converting an iterator of characters that contains null terminated strings into an iterator which just yields the strings, which is similar to the most recent use case which motivated this: parsing the names of functions in the __llvm_prf_names section in llvm profile instrumented binaries.
This is also useful for implementing finite state machines or stack automata which yield values upon reaching certain states and then continue parsing input.
Solution sketch
This API is an addition to std::Iterator. On any iterator, calling '.combine' will wrap it in the iterator below.
This iterator takes in an arbitrary State and a FnMut
taking a mutable reference to the state and an element from the iterator being wrapped and returning Option<B>
where B
is yielded from the iterator. This is the same lambda prototype that Scan
uses, though Scan
and Combine
are for largely unrelated use cases. The behavioral difference is that unlike Scan
, a return value of None doesn't stop iteration. Instead, it continues consuming values from the iterator it wraps until a call to F
yields Some
.
The idea being that many input iterations are combined to yield a single value.
into_inner yields the internal state of the iterator, which may be useful in case reaching the end of iteration might be a valid time to yield a value. If this is needed as part of the iterator, calling .finish will create a CombineFinish
iterator that handles this case using a FnOnce
that is called upon the Combine
iterator yielding None
.
Here is my napkin implementation for this, which was sufficient to try it out in a motivating use case (not shown is the trait I used to stick .combine
onto all implementations of Iterator. This would just be an additional method for the Iterator trait itself in practice)
pub struct Combine<I, St, F> {
iter: I,
f: F,
state: St,
}
impl<B, I, St, F> Combine<I, St, F>
where
I: Iterator,
F: FnMut(&mut St, I::Item) -> Option<B>
{
pub fn new(iter: I, state: St, f: F) -> Combine<I, St, F> {
Combine { iter, state, f }
}
pub fn into_inner(self) -> St {
self.state
}
pub fn finish<G: FnOnce(St) -> Option<B>>(self, g: G) -> CombineFinish<I, St, F, G> {
CombineFinish::new(self, g)
}
}
impl<B, I, St, F> Iterator for Combine<I, St, F>
where
I: Iterator,
F: FnMut(&mut St, I::Item) -> Option<B>
{
type Item = B;
fn next(&mut self) -> Option<B> {
while let Some(item) = self.iter.next() {
if let Some(output) = (self.f)(&mut self.state, item) {
return Some(output)
}
}
None
}
}
pub struct CombineFinish<I, St, F, G> {
inner: Option<(Combine<I, St, F>, G)>,
}
impl<B, I, St, F, G> CombineFinish<I, St, F, G>
where
I: Iterator,
G: FnOnce(St) -> Option<B>
{
pub fn new(iter: Combine<I, St, F>, g: G) -> CombineFinish<I, St, F, G> {
Self { inner: Some((iter, g)) }
}
}
impl<B, I, St, F, G> Iterator for CombineFinish<I, St, F, G>
where
I: Iterator,
F: FnMut(&mut St, I::Item) -> Option<B>,
G: FnOnce(St) -> Option<B>
{
type Item = B;
fn next(&mut self) -> Option<B> {
match self.inner.as_mut()?.0.next() {
Some(output) => Some(output),
None => {
let (iter, g) = self.inner.take()?;
g(iter.into_inner())
}
}
}
}
A usage example (parsing function names from a decompressed __llvm_prf_names block):
let fn_names = outbuf.drain(..).combine(Vec::new(), |v, c| {
match c {
0x01 => return Some(std::mem::replace(v, Vec::new())),
b':' => v.clear(),
x => v.push(x)
}
None
}).finish(|v| Some(v)).collect::<Vec<_>>();
Function names are terminated with 0x01, and the last function name in the block is not terminated so it is yielded when there are no more bytes left in the buffer. Some function names may be prepended with their containing filename followed by a colon, and in my usage I was uninterested in keeping that filename.
Alternatives
Indeed there are a number of ways to achieve this with existing APIs, though it would either require significant additional boilerplate such as making a custom iterator wrapper. Some creative use of filter_map
might get you halfway there, but it would require declaring the internal state on a separate line and passing a reference to it in a closure and the name filter_map
doesn't quite communicate what you're really trying to do. There's another trouble as well- what if the end of the input iterator represents a valid state to yield a value from? Therein lies the purpose of the optional CombineFinish extension.
This is a concise and inline way to express simple state machines which convert a series of tokens into a series of larger data yielded by combining those tokens, which occurs quite commonly in practice. This is a tool I I would have reached for several times in the past if it were available.
I will admit I had a hard time deciding on a satisfying name for this, so naming suggestions are more than welcome if you disagree that 'combine' or 'finish' get the point across well.
Links and related work
Similar to Java Stream's Gatherer: Java SE 22 Gatherer
Gatherer enables Many-to-One, One-to-One, and One-to-Many stream transformations. It likewise carries internal state and has .finisher()
, which is run after the input stream is ended.