Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ZeroMap::from_iter more efficient #2826

Open
Manishearth opened this issue Nov 10, 2022 · 28 comments
Open

Make ZeroMap::from_iter more efficient #2826

Manishearth opened this issue Nov 10, 2022 · 28 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band good first issue Good for newcomers help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt

Comments

@Manishearth
Copy link
Member

Using a bunch of insert()s is not super efficient since it can be quadratic with inefficient inserts.

We should perhaps construct a LiteMap first and then use ZeroMap::from_iter() or something. Ideally we have an efficient constructor from LiteMap that utilizes VarZeroVec's slice constructors.

Datagen perf isn't a huge deal so I consider this low priority, but it would be nice.

@Manishearth Manishearth added help wanted Issue needs an assignee C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) labels Nov 10, 2022
@sffc sffc added good first issue Good for newcomers T-techdebt Type: ICU4X code health and tech debt backlog discuss Discuss at a future ICU4X-SC meeting needs-approval One or more stakeholders need to approve proposal and removed discuss Discuss at a future ICU4X-SC meeting labels Dec 22, 2022
@sffc
Copy link
Member

sffc commented Dec 22, 2022

@Manishearth says: ideally we have an API that takes a vector and then sorts the vector; this is more efficient than FromIterator since that one only sees one element at a time. We could also use a BTreeMap. But more discussion is required.

CC @robertbastian @sffc

@Manishearth
Copy link
Member Author

The main problem here is VarZeroVec constructions.

VarZeroVec has an efficient method of construction given a sorted slice: it uses the length of the slice to preallcoate space, and then slots stuff in one by one. We can also add one that takes in a BTreeMap, or ExactSizeIterators in general.

However, from_iter() just calls append() in a sequence because it doesn't know the number of elements so it cannot preallocate space. This means a copy/shift operation each iteration.

A potential optimization on VarZeroVec is to use size_hint() to hit the ExactSizeIterator path. That would let us continue to use from_iter(), but we would have to use it with iterators that are known to be exact size, which is easy to lose by accident.

I think the most efficient route is:

  • Add a VZV constructor from ExactSizeIterators (and a BTreeMap From impl)
  • construct BTreeMaps in datagen

@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
@robertbastian
Copy link
Member

I don't think we should add any special constructors. FromIterator will be as efficient as any constructor if you do something like btreemap.into_iter().collect() (and you check the size_hint in FromIterator, which you always should), and it's the rusty way to do this.

@Manishearth
Copy link
Member Author

The problem is that size_hint isn't guaranteed to be accurate and that code gets really gnarly to write because if size_hint is wrong we need to fall back to the old behavior using a partially initialized vector.

It's still doable: you're welcome to take a crack at it. Perhaps the behavior can be:

  • if no size_hint, just append
  • if there is a size hint, preallocate, and keep track of stuff written
  • if we end up not having enugh elements to write, manually reconstruct a new one
  • if we end up having more elements, just keep appending

@sffc
Copy link
Member

sffc commented Jan 20, 2023

I think I prefer fixing up FromIterator and the VZV mutation code to be more invariant to the exact structure of the VZV, which would fix this (we can use size_hint more easily) and also enables using alternative index stores like FZV or even AsciiTrie

@Manishearth
Copy link
Member Author

more invariant to the exact structure of the VZV,

that would definitely not fix this since efficient mutation requires maintaing partially initialized state very carefully. it's specific to this format.

also enables using alternative index stores like FZV or even AsciiTrie

I think this is completely orthogonal: eventually this will probably be driven through the VZVFormat trait so alternate index stores will still get the right FromIterator code.

I'd like to discuss the FromIterator code for the current format here, other formats will be able to do FromIterator their own way if they wish.

@robertbastian
Copy link
Member

I see the problem now. For optimal performance we have to iterate twice (by reference): once to calculate the length of each item, once to serialize the items (basically what get_serializable_bytes does at the moment). Currently our "bound" for an input like this is &[A]:

pub fn try_from_elements<A>(elements: &[A]) -> Result<Self, &'static str>
    where
        A: EncodeAsVarULE<T>

We can extend this to arbitrary repeatable sequences (e.g. BTreeMap::keys). It'd be something like

pub fn try_from_elements<I, A>(elements: I) -> Result<Self, &'static str>
    where
        A: EncodeAsVarULE<T>,
        I: Iterator<&'a A> + Clone

This is not doable with the FromIterator trait, because its input is a generic IntoIterator. I wish Rust had a trait for repeatable iterators, we're also doing the Clone thing in the list formatter to predict the output length.

@Manishearth
Copy link
Member Author

Yep. We have two options here:

  • Special case some specific types
  • Don't special case anything, but support size_hint in the complicated way I sketched out above

@robertbastian
Copy link
Member

I don't think my proposed solution falls under either of those?

@Manishearth
Copy link
Member Author

Oh, right

I don't think we should be iterating twice. If we're going to preallocate a vector we should just accept a slice, and if we're going to clone the iterator we should just use size_hint or ExactSizeIterator and use the second option I proposed.

@robertbastian
Copy link
Member

robertbastian commented Jan 20, 2023

Why not? Cloning something like std::collections::btree_map::Keys is very cheap, as it doesn't contain any values (well, keys).

@Manishearth
Copy link
Member Author

Yeah but there's no guarantee it will always have the same behavior so you still have to guard against that.

Also iterating twice may not be cheap

@robertbastian
Copy link
Member

Yeah but there's no guarantee it will always have the same behavior so you still have to guard against that.

Huh? A clone should give an object that behaves the same way.

Also iterating twice may not be cheap

🤷 There's a clone bound on the iterator, if it's not cheap to clone that's the client's problem.

@Manishearth
Copy link
Member Author

Huh? A clone should give an object that behaves the same way.

We cannot rely on this from a soundness standpoint, you can write an iterator that yields rand().

There's a clone bound on the iterator, if it's not cheap to clone that's the client's problem.

?????? Clone has no association with cheapness. That's Copy, and unfortunately iterators are never Copy since it's a footgun

vector.into_iter() is cloneable and expensive to clone

@robertbastian
Copy link
Member

?????? Clone has no association with cheapness.

I mean that it is documented that we will clone this, so if you want it to be efficient, it's your responsibility to make the clone efficient. If you don't care you can pass in a vector.into_iter().

@Manishearth
Copy link
Member Author

I mean that it is documented that we will clone this, so if you want it to be efficient, it's your responsibility to make the clone efficient. If you don't care you can pass in a vector.into_iter().

Not if we use From or something, it's not the norm to read From docs.

@sffc
Copy link
Member

sffc commented Jan 21, 2023

more invariant to the exact structure of the VZV,

that would definitely not fix this since efficient mutation requires maintaing partially initialized state very carefully. it's specific to this format.

also enables using alternative index stores like FZV or even AsciiTrie

I think this is completely orthogonal: eventually this will probably be driven through the VZVFormat trait so alternate index stores will still get the right FromIterator code.

I'd like to discuss the FromIterator code for the current format here, other formats will be able to do FromIterator their own way if they wish.

The idea is that you have your index structure and your values structure both backed by a &mut [u8]. You can pre-allocate however much space you think you need for each. When you need more space, things get shifted around to accommodate. We can do this without being specific to the VZV format but still being equally efficient as the current implementation.

@sffc
Copy link
Member

sffc commented Jan 21, 2023

Something kinda like

/// # Safety
///
/// 1. get_ordered_indices MUST return an ordered list of all indices
/// 2. get_range_for MUST return two adjacent indices
pub unsafe trait IndexStore<'a, K: ?Sized> {
  type Error;
  fn try_new(bytes: &'a [u8]) -> Result<Self, Self::Error>;
  /// Gets the range in the data slice for the given key
  fn get_range_for<Q: ?Sized>(key: Q) -> Range<usize> where K: Borrow<Q>;
  /// Used for verification of values during deserialization
  fn get_ordered_indices(&self) -> impl Iterator<usize> + '_;
}

pub trait MutableIndexStore<'a, K> : IndexStore<'a, K> {
  fn new(&'a mut [u8]) -> Self;
  fn insert(&mut self, key: K, length: usize) -> IndexStoreInsertResult;
  fn into_buffer(self) -> &mut [u8];
}

pub enum IndexStoreInsertResult {
  /// Insertion was successful; the value should be inserted into the data store at the given position
  Success(usize),
  /// There is not enough room in the store's backing buffer; please add this much more space
  BufferOverflow(usize),
  /// The insertion exceeds the limits of this index store; please use a different index store
  Limit,
}

pub struct VarZeroMap<'a, K, V: VarULE, I: IndexStore<'a, K>> {
  length: u32,
  indices: I,
  values: &'a [u8],
  _value_type: PhantomData<V>,
}

pub struct VarZeroMapSlice {
  length: u32::ULE,
  indices: [u8],
  values: [u8], // not currently possible to have two of these in the same struct
}

For example, the following types can implement IndexStore:

  • impl IndexStore<usize> for FlexZeroVec
  • impl IndexStore<usize> for ZeroSlice<u16>
  • impl IndexStore<str> for AsciiTrie
  • impl IndexStore<Foo> for ZeroMap<Foo, usize>

Then if you want to map from ASCII str to str, for example, you can use a VarZeroMap<str, str, AsciiTrie>, which is more efficient than ZeroMap<str, str>:

  • VarZeroMap<str, str, AsciiTrie> contains an AsciiTrie and the value strings data array. That's it.
  • ZeroMap<str, str> contains an index table for the key strings, then the key strings data array, then an index table for the value strings, then the value strings data array.

This addresses one of the pain points that often comes up with VarZeroVec which is that you want to skip the extra lookup in the index array if you already know the indices into the data array.

@Manishearth
Copy link
Member Author

I mean, yes, that's the long term plan we made ages ago: ZeroMap should be generic over an index store and value store (instead of doing the messy store-selection mess).

We still need better mutation code for the current VZV construction code, whether we stick a trait in the way or not.

I would rather not link these two problems, the index/value store stuff to me is a longer term thing we should do once we've got some experience with AsciiTrie and ZeroHashMap.

There are three layers of problems here. Each needs to be solved at some point, and solving a larger problem does not automatically solve the smaller problem:

  • standard-format VZV construction is slow but could be faster (and this is bad on its own but also impacts map constructions)
  • custom VZV formats aren't yet general enough to do fancy things, and when we get there we need to make construction work for them too
    • this doesn't automatically solve the previous problem (it just means you need to solve it in a generic fashion)
  • custom index stores for maps should exist
    • this does not automatically solve either of the two former problems, though it dovetails nicely with custom VZV formats

To me, each of these is a separate problem that should be tackled separately on its own timeline.

@sffc
Copy link
Member

sffc commented Jan 24, 2023

This is not ZeroMap; it's VarZeroMap. The indexes are offsets into the variable-length-data buffer. This requires a more sophisticated index store than what a ZeroMap needs. It's also more rigorous because we need to validate the contents of the data buffer upon deserialization.

We're going to have to re-write mutation code with index stores again, even if we re-write it now. I don't see how fixing the problem within the current (minimally-configurable index) framework sets us up for an easier time migrating to a fully-configurable index.

@Manishearth
Copy link
Member Author

Manishearth commented Jan 24, 2023

Three reasons:

  • completely ignoring maps, this is also just a problem for VarZeroVec in general that I'd like fixed (regardless of the title of this issue)
  • the fully configurable index will still need this problem somewhat solved for VarZeroVec if we want to use VarZeroVec as one of the configurable indices
  • The solution here will better inform that API in the future configurable-index solution will use

@sffc sffc added discuss Discuss at a future ICU4X-SC meeting and removed needs-approval One or more stakeholders need to approve proposal labels Jan 24, 2023
@robertbastian
Copy link
Member

If VZV mutation is so bad, why do we support it? I don't think we need it anywhere, and it seems like it complicates a lot of things. Isn't it enough to have VZV constructible from a slice?

@Manishearth
Copy link
Member Author

So people can make fixups if they want. Especially for maps.

It's not completely horrible, it's just not great. We wanted these types to be Cow-like as much as possible.

@robertbastian
Copy link
Member

#3098 reduces the problem to ZeroMap::from_iter's implementation. That currently does repeated insertion, as that's the only thing you can do on a MutableZeroVecLike. So we need to

  • implement more efficient construction of VZV from an iterator (or a concrete type like std::collections::btree_map::{Keys, Values}
  • Find a way to access this from ZeroMap::from_iter. This can be by extending the MutableZeroVecLike trait, or by replacing ZeroMap::from_iter by a more concrete function like ZeroMap::from_btree_map. We'd probably need four implementations depending on K/V::Container being ZV/VZV.

@robertbastian robertbastian changed the title Use more efficient ZeroMap constructions in datagen Make ZeroMap::from_iter more efficient Feb 10, 2023
@robertbastian
Copy link
Member

@pdogr fyi

@sffc
Copy link
Member

sffc commented Jun 15, 2023

Discuss with:

Optional:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023
@Manishearth
Copy link
Member Author

impl<'a, A, B, K, V> FromIterator<(A, B)> for ZeroMap<'a, K, V>
where
    A: Borrow<K>,
    B: Borrow<V>,
    K: ZeroMapKV<'a> + ?Sized + Ord,
    V: ZeroMapKV<'a> + ?Sized,
{
    fn from_iter<T>(iter: T) -> Self
    where
        T: IntoIterator<Item = (A, B)>,
    {
        let iter = iter.into_iter();
        let mut map = match iter.size_hint() {
            (_, Some(upper)) => Self::with_capacity(upper),
            (lower, None) => Self::with_capacity(lower),
        };

        for (key, value) in iter {
            if let Some((key, value)) = map.try_append(key.borrow(), value.borrow()) {
                map.insert(key, value);
            }
        }
        map
    }
}

this is the problematic FromIterator impl.

It does allow out-of-order key insertion.

I think we need some way to allow MutableZVLikes to have an intermediate construction builder object (which for ZVs is just ZV). For VZV the object will happily construct a VZV piece-by-piece until you perform a non-appending insert at which point it will bail and switch over to a BTreeMap-based implementation.

In the long run, the fix is in the design for #3128

@Manishearth
Copy link
Member Author

It's unclear to me if there's discussion to be had here: I think this issue mostly needs implementation work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band good first issue Good for newcomers help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

No branches or pull requests

4 participants