Make ZeroMap::from_iter more efficient #2826

Manishearth · 2022-11-10T23:19:57Z

Using a bunch of insert()s is not super efficient since it can be quadratic with inefficient inserts.

We should perhaps construct a LiteMap first and then use ZeroMap::from_iter() or something. Ideally we have an efficient constructor from LiteMap that utilizes VarZeroVec's slice constructors.

Datagen perf isn't a huge deal so I consider this low priority, but it would be nice.

The text was updated successfully, but these errors were encountered:

sffc · 2022-12-22T17:40:52Z

@Manishearth says: ideally we have an API that takes a vector and then sorts the vector; this is more efficient than FromIterator since that one only sees one element at a time. We could also use a BTreeMap. But more discussion is required.

CC @robertbastian @sffc

Manishearth · 2022-12-22T17:49:53Z

The main problem here is VarZeroVec constructions.

VarZeroVec has an efficient method of construction given a sorted slice: it uses the length of the slice to preallcoate space, and then slots stuff in one by one. We can also add one that takes in a BTreeMap, or ExactSizeIterators in general.

However, from_iter() just calls append() in a sequence because it doesn't know the number of elements so it cannot preallocate space. This means a copy/shift operation each iteration.

A potential optimization on VarZeroVec is to use size_hint() to hit the ExactSizeIterator path. That would let us continue to use from_iter(), but we would have to use it with iterators that are known to be exact size, which is easy to lose by accident.

I think the most efficient route is:

Add a VZV constructor from ExactSizeIterators (and a BTreeMap From impl)
construct BTreeMaps in datagen

robertbastian · 2023-01-19T18:56:42Z

I don't think we should add any special constructors. FromIterator will be as efficient as any constructor if you do something like btreemap.into_iter().collect() (and you check the size_hint in FromIterator, which you always should), and it's the rusty way to do this.

Manishearth · 2023-01-19T19:12:31Z

The problem is that size_hint isn't guaranteed to be accurate and that code gets really gnarly to write because if size_hint is wrong we need to fall back to the old behavior using a partially initialized vector.

It's still doable: you're welcome to take a crack at it. Perhaps the behavior can be:

if no size_hint, just append
if there is a size hint, preallocate, and keep track of stuff written
if we end up not having enugh elements to write, manually reconstruct a new one
if we end up having more elements, just keep appending

sffc · 2023-01-20T00:02:43Z

I think I prefer fixing up FromIterator and the VZV mutation code to be more invariant to the exact structure of the VZV, which would fix this (we can use size_hint more easily) and also enables using alternative index stores like FZV or even AsciiTrie

Manishearth · 2023-01-20T00:21:15Z

more invariant to the exact structure of the VZV,

that would definitely not fix this since efficient mutation requires maintaing partially initialized state very carefully. it's specific to this format.

also enables using alternative index stores like FZV or even AsciiTrie

I think this is completely orthogonal: eventually this will probably be driven through the VZVFormat trait so alternate index stores will still get the right FromIterator code.

I'd like to discuss the FromIterator code for the current format here, other formats will be able to do FromIterator their own way if they wish.

robertbastian · 2023-01-20T15:00:25Z

I see the problem now. For optimal performance we have to iterate twice (by reference): once to calculate the length of each item, once to serialize the items (basically what get_serializable_bytes does at the moment). Currently our "bound" for an input like this is &[A]:

pub fn try_from_elements<A>(elements: &[A]) -> Result<Self, &'static str>
    where
        A: EncodeAsVarULE<T>

We can extend this to arbitrary repeatable sequences (e.g. BTreeMap::keys). It'd be something like

pub fn try_from_elements<I, A>(elements: I) -> Result<Self, &'static str>
    where
        A: EncodeAsVarULE<T>,
        I: Iterator<&'a A> + Clone

This is not doable with the FromIterator trait, because its input is a generic IntoIterator. I wish Rust had a trait for repeatable iterators, we're also doing the Clone thing in the list formatter to predict the output length.

Manishearth · 2023-01-20T17:47:44Z

Yep. We have two options here:

Special case some specific types
Don't special case anything, but support size_hint in the complicated way I sketched out above

robertbastian · 2023-01-20T20:17:49Z

I don't think my proposed solution falls under either of those?

Manishearth · 2023-01-20T20:20:41Z

Oh, right

I don't think we should be iterating twice. If we're going to preallocate a vector we should just accept a slice, and if we're going to clone the iterator we should just use size_hint or ExactSizeIterator and use the second option I proposed.

robertbastian · 2023-01-20T20:30:51Z

Why not? Cloning something like std::collections::btree_map::Keys is very cheap, as it doesn't contain any values (well, keys).

Manishearth · 2023-01-20T20:33:44Z

Yeah but there's no guarantee it will always have the same behavior so you still have to guard against that.

Also iterating twice may not be cheap

robertbastian · 2023-01-20T20:43:13Z

Yeah but there's no guarantee it will always have the same behavior so you still have to guard against that.

Huh? A clone should give an object that behaves the same way.

Also iterating twice may not be cheap

🤷 There's a clone bound on the iterator, if it's not cheap to clone that's the client's problem.

Manishearth · 2023-01-20T20:45:18Z

Huh? A clone should give an object that behaves the same way.

We cannot rely on this from a soundness standpoint, you can write an iterator that yields rand().

There's a clone bound on the iterator, if it's not cheap to clone that's the client's problem.

?????? Clone has no association with cheapness. That's Copy, and unfortunately iterators are never Copy since it's a footgun

vector.into_iter() is cloneable and expensive to clone

robertbastian · 2023-01-20T20:56:19Z

?????? Clone has no association with cheapness.

I mean that it is documented that we will clone this, so if you want it to be efficient, it's your responsibility to make the clone efficient. If you don't care you can pass in a vector.into_iter().

Manishearth · 2023-01-20T21:43:23Z

I mean that it is documented that we will clone this, so if you want it to be efficient, it's your responsibility to make the clone efficient. If you don't care you can pass in a vector.into_iter().

Not if we use From or something, it's not the norm to read From docs.

sffc · 2023-01-21T02:40:39Z

more invariant to the exact structure of the VZV,

that would definitely not fix this since efficient mutation requires maintaing partially initialized state very carefully. it's specific to this format.

also enables using alternative index stores like FZV or even AsciiTrie

I think this is completely orthogonal: eventually this will probably be driven through the VZVFormat trait so alternate index stores will still get the right FromIterator code.

I'd like to discuss the FromIterator code for the current format here, other formats will be able to do FromIterator their own way if they wish.

The idea is that you have your index structure and your values structure both backed by a &mut [u8]. You can pre-allocate however much space you think you need for each. When you need more space, things get shifted around to accommodate. We can do this without being specific to the VZV format but still being equally efficient as the current implementation.

sffc · 2023-01-21T23:17:51Z

Something kinda like

/// # Safety
///
/// 1. get_ordered_indices MUST return an ordered list of all indices
/// 2. get_range_for MUST return two adjacent indices
pub unsafe trait IndexStore<'a, K: ?Sized> {
  type Error;
  fn try_new(bytes: &'a [u8]) -> Result<Self, Self::Error>;
  /// Gets the range in the data slice for the given key
  fn get_range_for<Q: ?Sized>(key: Q) -> Range<usize> where K: Borrow<Q>;
  /// Used for verification of values during deserialization
  fn get_ordered_indices(&self) -> impl Iterator<usize> + '_;
}

pub trait MutableIndexStore<'a, K> : IndexStore<'a, K> {
  fn new(&'a mut [u8]) -> Self;
  fn insert(&mut self, key: K, length: usize) -> IndexStoreInsertResult;
  fn into_buffer(self) -> &mut [u8];
}

pub enum IndexStoreInsertResult {
  /// Insertion was successful; the value should be inserted into the data store at the given position
  Success(usize),
  /// There is not enough room in the store's backing buffer; please add this much more space
  BufferOverflow(usize),
  /// The insertion exceeds the limits of this index store; please use a different index store
  Limit,
}

pub struct VarZeroMap<'a, K, V: VarULE, I: IndexStore<'a, K>> {
  length: u32,
  indices: I,
  values: &'a [u8],
  _value_type: PhantomData<V>,
}

pub struct VarZeroMapSlice {
  length: u32::ULE,
  indices: [u8],
  values: [u8], // not currently possible to have two of these in the same struct
}

For example, the following types can implement IndexStore:

impl IndexStore<usize> for FlexZeroVec
impl IndexStore<usize> for ZeroSlice<u16>
impl IndexStore<str> for AsciiTrie
impl IndexStore<Foo> for ZeroMap<Foo, usize>

Then if you want to map from ASCII str to str, for example, you can use a VarZeroMap<str, str, AsciiTrie>, which is more efficient than ZeroMap<str, str>:

VarZeroMap<str, str, AsciiTrie> contains an AsciiTrie and the value strings data array. That's it.
ZeroMap<str, str> contains an index table for the key strings, then the key strings data array, then an index table for the value strings, then the value strings data array.

This addresses one of the pain points that often comes up with VarZeroVec which is that you want to skip the extra lookup in the index array if you already know the indices into the data array.

Manishearth · 2023-01-24T00:58:25Z

I mean, yes, that's the long term plan we made ages ago: ZeroMap should be generic over an index store and value store (instead of doing the messy store-selection mess).

We still need better mutation code for the current VZV construction code, whether we stick a trait in the way or not.

I would rather not link these two problems, the index/value store stuff to me is a longer term thing we should do once we've got some experience with AsciiTrie and ZeroHashMap.

There are three layers of problems here. Each needs to be solved at some point, and solving a larger problem does not automatically solve the smaller problem:

standard-format VZV construction is slow but could be faster (and this is bad on its own but also impacts map constructions)
custom VZV formats aren't yet general enough to do fancy things, and when we get there we need to make construction work for them too
- this doesn't automatically solve the previous problem (it just means you need to solve it in a generic fashion)
custom index stores for maps should exist
- this does not automatically solve either of the two former problems, though it dovetails nicely with custom VZV formats

To me, each of these is a separate problem that should be tackled separately on its own timeline.

sffc · 2023-01-24T01:05:54Z

This is not ZeroMap; it's VarZeroMap. The indexes are offsets into the variable-length-data buffer. This requires a more sophisticated index store than what a ZeroMap needs. It's also more rigorous because we need to validate the contents of the data buffer upon deserialization.

We're going to have to re-write mutation code with index stores again, even if we re-write it now. I don't see how fixing the problem within the current (minimally-configurable index) framework sets us up for an easier time migrating to a fully-configurable index.

Manishearth · 2023-01-24T01:09:30Z

Three reasons:

completely ignoring maps, this is also just a problem for VarZeroVec in general that I'd like fixed (regardless of the title of this issue)
the fully configurable index will still need this problem somewhat solved for VarZeroVec if we want to use VarZeroVec as one of the configurable indices
The solution here will better inform that API in the future configurable-index solution will use

robertbastian · 2023-01-24T08:00:38Z

If VZV mutation is so bad, why do we support it? I don't think we need it anywhere, and it seems like it complicates a lot of things. Isn't it enough to have VZV constructible from a slice?

Manishearth · 2023-01-24T16:07:24Z

So people can make fixups if they want. Especially for maps.

It's not completely horrible, it's just not great. We wanted these types to be Cow-like as much as possible.

robertbastian · 2023-02-09T16:13:11Z

#3098 reduces the problem to ZeroMap::from_iter's implementation. That currently does repeated insertion, as that's the only thing you can do on a MutableZeroVecLike. So we need to

implement more efficient construction of VZV from an iterator (or a concrete type like std::collections::btree_map::{Keys, Values}
Find a way to access this from ZeroMap::from_iter. This can be by extending the MutableZeroVecLike trait, or by replacing ZeroMap::from_iter by a more concrete function like ZeroMap::from_btree_map. We'd probably need four implementations depending on K/V::Container being ZV/VZV.

robertbastian · 2023-02-16T09:32:16Z

@pdogr fyi

sffc · 2023-06-15T18:13:03Z

Discuss with:

Optional:

Manishearth · 2023-07-03T15:38:17Z

impl<'a, A, B, K, V> FromIterator<(A, B)> for ZeroMap<'a, K, V>
where
    A: Borrow<K>,
    B: Borrow<V>,
    K: ZeroMapKV<'a> + ?Sized + Ord,
    V: ZeroMapKV<'a> + ?Sized,
{
    fn from_iter<T>(iter: T) -> Self
    where
        T: IntoIterator<Item = (A, B)>,
    {
        let iter = iter.into_iter();
        let mut map = match iter.size_hint() {
            (_, Some(upper)) => Self::with_capacity(upper),
            (lower, None) => Self::with_capacity(lower),
        };

        for (key, value) in iter {
            if let Some((key, value)) = map.try_append(key.borrow(), value.borrow()) {
                map.insert(key, value);
            }
        }
        map
    }
}

this is the problematic FromIterator impl.

It does allow out-of-order key insertion.

I think we need some way to allow MutableZVLikes to have an intermediate construction builder object (which for ZVs is just ZV). For VZV the object will happily construct a VZV piece-by-piece until you perform a non-appending insert at which point it will bail and switch over to a BTreeMap-based implementation.

In the long run, the fix is in the design for #3128

Manishearth · 2024-02-28T17:23:24Z

It's unclear to me if there's discussion to be had here: I think this issue mostly needs implementation work.

Manishearth added help wanted Issue needs an assignee C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) labels Nov 10, 2022

Manishearth mentioned this issue Nov 10, 2022

Transformer for RelativeTimeFormat component #2822

Merged

sffc added good first issue Good for newcomers T-techdebt Type: ICU4X code health and tech debt backlog discuss Discuss at a future ICU4X-SC meeting needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 22, 2022

sffc added this to the Backlog milestone Dec 22, 2022

sffc removed the backlog label Dec 22, 2022

sffc added discuss Discuss at a future ICU4X-SC meeting and removed needs-approval One or more stakeholders need to approve proposal labels Jan 24, 2023

robertbastian mentioned this issue Feb 9, 2023

Less ZeroMap mutation in datagen #3098

Merged

robertbastian changed the title ~~Use more efficient ZeroMap constructions in datagen~~ Make ZeroMap::from_iter more efficient Feb 10, 2023

robertbastian mentioned this issue May 15, 2023

Add LiteMap to ZeroMap[2d] conversion functions #1331

Open

1 task

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023

robertbastian assigned sffc, younies and Manishearth Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ZeroMap::from_iter more efficient #2826

Make ZeroMap::from_iter more efficient #2826

Manishearth commented Nov 10, 2022

sffc commented Dec 22, 2022

Manishearth commented Dec 22, 2022

robertbastian commented Jan 19, 2023

Manishearth commented Jan 19, 2023

sffc commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023 •

edited

Loading

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

sffc commented Jan 21, 2023

sffc commented Jan 21, 2023 •

edited

Loading

Manishearth commented Jan 24, 2023

sffc commented Jan 24, 2023

Manishearth commented Jan 24, 2023 •

edited

Loading

robertbastian commented Jan 24, 2023

Manishearth commented Jan 24, 2023

robertbastian commented Feb 9, 2023

robertbastian commented Feb 16, 2023

sffc commented Jun 15, 2023 •

edited

Loading

Manishearth commented Jul 3, 2023

Manishearth commented Feb 28, 2024

Make ZeroMap::from_iter more efficient #2826

Make ZeroMap::from_iter more efficient #2826

Comments

Manishearth commented Nov 10, 2022

sffc commented Dec 22, 2022

Manishearth commented Dec 22, 2022

robertbastian commented Jan 19, 2023

Manishearth commented Jan 19, 2023

sffc commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023 • edited Loading

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

robertbastian commented Jan 20, 2023

Manishearth commented Jan 20, 2023

sffc commented Jan 21, 2023

sffc commented Jan 21, 2023 • edited Loading

Manishearth commented Jan 24, 2023

sffc commented Jan 24, 2023

Manishearth commented Jan 24, 2023 • edited Loading

robertbastian commented Jan 24, 2023

Manishearth commented Jan 24, 2023

robertbastian commented Feb 9, 2023

robertbastian commented Feb 16, 2023

sffc commented Jun 15, 2023 • edited Loading

Manishearth commented Jul 3, 2023

Manishearth commented Feb 28, 2024

robertbastian commented Jan 20, 2023 •

edited

Loading

sffc commented Jan 21, 2023 •

edited

Loading

Manishearth commented Jan 24, 2023 •

edited

Loading

sffc commented Jun 15, 2023 •

edited

Loading