Skip to content

Add an Iterator::batch transformer and explicit size hints for FromIterator and Extend #14271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

erickt
Copy link
Contributor

@erickt erickt commented May 18, 2014

This pull request is focused on adding a Iterator::batch transformer. batch allows you to batch up multiple values from a stream into one value. For example, this can be used for transforming a stream of a monotonically increasing infinite iterator like count(0, 1) into a stream of vec!(0, 1, 2), vec!(3, 4, 5), and etc. It can also be seen as a low level transformer primitive that could be used to implement a variety of iterator transformers, as you could use it to implement map, filter, reduce, and more.

Along the way this also adds support for passing an explicit size hints to .collect(), .from_iter() and .extend() in the form of .collect_with_capacity(), .from_iter_with_capacity() and .extend_with_capacity(). Iterator size hints and .collect() calls are generally conservative in the size of collections they preallocate. For example, a filter call like let xs: Vec<int> = range(1, 5).filter(|x| x != 3).collect() would initially allocate a zero-sized Vec<T> because the .filter() gives a lower bound of 0, and collect always uses the lower bounds. There is no way for the user to inform .collect() to preallocate space for 4 items. Similarly, there isn't a way to overallocate a collection with .collect(). These new methods allow the user to have better control over the memory allocation while still using generic methods.

I have to admit though, I'm not sure if it's better to create methods like .collect_with_capacity(), or if we should rename .size_hint() to .size_bounds(), provide a new Iterator.size_hint() that may be outside the bounds of the values, and a .sized_by() transformer that lets a user to provide a different hint than the lower bounds.

Finally, it also adds some .reserve_additional methods on some collections, and cleans up some dead code.

Some(it.by_ref().take(3).collect::<Vec<uint>>())
}).take(3);

assert!(it.next() == Some(vec!(0, 1, 2)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_eq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get assert_eq!() to work in libcore because I can't get it to find the Show impl for Vec. I feel like I should be able to import Show, but for some reason that doesn't work,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because Vec comes from realstd, which means it's implementing a different Show than the one compiled in this libcore.

@lilyball
Copy link
Contributor

cc me

@lilyball
Copy link
Contributor

from_iter_with_capacity() is a bit unnecessary. We can just change from_iter() to take capacity and change collect() to pass the lower bound. Nobody calls FromIterator directly anyway, so there's no need to provide a convenience method from within that trait.

pub trait FromIterator<A> {
    /// Build a container with elements from an external iterator.
    ///
    /// `cap_hint` is a hint as to what the capacity of the collection should be.
    /// It is usually, but not always, the lower size bound of the iterator.
    fn from_iter<T: Iterator<A>>(iterator: T, cap_hint: uint) -> Self;
}

pub trait Iterator<A> {
    // ...
    fn collect<B: FromIterator<A>>(&mut self) -> B {
        let (lower, _) = self.size_hint();
        FromIterator::from_iter(self.by_ref(), lower)
    }
}

Similarly we don't need collect_with_capacity(). Actual uses of this functionality are likely to be rare enough that anyone who does need it can just use FromIterator::from_iter() directly.

@lilyball
Copy link
Contributor

Aside from my previous comment, I'm generally in favor of this patch. I'm not thrilled about making the core method of Extendable into extend_with_capacity(), but there's no obvious alternative solution (beyond leaving this capability out of Extendable at all).

As for the suggested alternative to this patch, which is to change size_hint() into size_bound() and add a separate size_hint(), I don't think that will work. Notably, the wrapper Iterator that is returned by .sized_by() will not know how to appropriately adjust size_hint() as the iterator is consumed.

@erickt
Copy link
Contributor Author

erickt commented May 18, 2014

@kballard: I'm fine merging from_iter and from_iter_with_capacity. Regarding .collect_with_capacity(), I guess we could leave it for a possible future PR, but it certainly is nicer to be able to just call let x: Vec<int> = iter.collect_with_capacity(10) then to do:

use std::FromIterator;
...
let x: Vec<int> = FromIterator::from_iter(iter, 10);

It does have some nice symmetry with a .extend_with_capacity() method too.

Either way, I found maybe 10-20 spots in the stdlib we could use this _with_capacity feature, so you're right it's not common, but we could use it to simplify down some code.

@lilyball
Copy link
Contributor

@erickt I'm largely just concerned about API bloat. Iterator already has 30 default methods.

@alexcrichton
Copy link
Member

I would be more in favor of adding an iterator adaptor which modifies the return value of size_hint than extra methods for collect and friends. Something like:

[1, 2, 3, 4].iter().filter(|&i| i %2 == 0).size_hint()  // => (0, Some(4))
[1, 2, 3, 4].iter().filter(|&i| i %2 == 0).optimistic().size_hint()  // => (4, Some(4))

@lilyball
Copy link
Contributor

@alexcrichton size_hint() is defined as a lower and upper bound. Every single Iterator in the standard libraries (that provides a size_hint()) conforms to this, and guarantees that it will vend at least as many elements as the lower bound of its size_hint(), and no more elements than its upper bound (if it has one). What you're proposing is a violation of this guarantee.

The only way to make this work is to redefine size_hint() so it's no longer bounds, or to provide an alternate value separate from the bounds that can be overridden.

A problem with doing this alternative hint is defining how it changes as the iterator is used up. The current size_hint() bounds provide their guarantee of correctness throughout the entire lifetime of the iterator. But if I create an iterator and override its alternative hint to be some specific number, how does the iterator know how that number should change as it's processed?

This is why I'm in favor of providing a way to override the capacity in from_iter(), as opposed to trying to define this estimate as a property of the iterator itself. I can calculate a good estimate when I first create my iterator, based on the knowledge I have about it (e.g. I know my filter() will only drop roughly 50% of the elements). But the iterator doesn't know how this estimate should change as it's consumed, nor do I know how to do that myself if I were given control over it.

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to reopen with a rebase!

lnicola pushed a commit to lnicola/rust that referenced this pull request Mar 13, 2023
Don't trigger unresolved method/field diagnostics on types containing errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants