Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serokell: [Milestone-1] New PersistentOrderedMap.mo lib #654

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

GoPavel
Copy link

@GoPavel GoPavel commented Sep 5, 2024

This is an MR for the 1st Milestone of Serokell's grant about improving Motoko's base library.

Within it, we are introducing a new functional implementation of the map data structure based on the current RBTree.mo.

Content of the MR:

  • extract the implementation from the RBTree.mo
  • change API to keep it convenient to store the map in the stable variable
  • add missing functionality
    • map
    • foldLeft, foldRight
    • mapFilter
    • keys, vals
  • optimize tree representation via elimination of option type ?V -> V
  • unit tests
  • property-based tests
  • documentation
    • check typos again
    • add examples

@sa-github-api
Copy link

Dear @GoPavel,

In order to potentially merge your code in this open-source repository and therefore proceed with your contribution, we need to have your approval on DFINITY's CLA.

If you decide to agree with it, please visit this issue and read the instructions there. Once you have signed it, re-trigger the workflow on this PR to see if your code can be merged.

— The DFINITY Foundation

@GoPavel GoPavel changed the title [Draft] Serokell: New PersistentMap.mo lib [Draft] Serokell: New PersistentOrderedMap.mo lib Sep 30, 2024
@GoPavel GoPavel changed the title [Draft] Serokell: New PersistentOrderedMap.mo lib Serokell: New PersistentOrderedMap.mo lib Oct 2, 2024
src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some interim comments (more to come, just submitting because I expect GH might lost my review soon)

src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
/// with `Map` to maintain invariant that comparator did not changed.
///
/// `MapOps` contains methods that require `compare` internally:
/// operations that may reshape a `Map` or should find something.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// operations that may reshape a `Map` or should find something.
/// operations that update, traverse or search a map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think traverse doesn't require comp (since we keep tree structure) so it's outside of the MapOps.
So I think it should be

Suggested change
/// operations that may reshape a `Map` or should find something.
/// operations that update or search a map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a UX question - should we put all operations in MapOps, even if they don't require the comparison, just for uniformity? I guess we could also do that later easily enough. Something to think about.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put all operations in MapOps

I also thought about this. We were going to measure its overhead on a benchmark and then decide. If there is no difference I would also put it there.

However for now some modules like RBTree or TrieMap provide some functionality as a module-function, not as a member-function e.g. map or iter. So we thought that the current solution is not very inconsistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did performance experiments:

  1. Comparing inlined internal functions and the current version (all calls through MapOps): [DMS-68] Inline internal functions into MapOps object serokell/motoko-base#21
  2. Comparing calling methods through MapOps and through the Internal module passing comp manually: [DMS-68] Compare function calls from MapOps with direct function call serokell/motoko-base#20

The 1st experiment shows that the inlined version has more or less the same performance (<1% difference).
The 2nd one shows that a static function call is faster than a similar one from MapOps on 1-4% percent.

Thus we can say that static calls are less convenient but faster, so probably it makes sense to keep the option ?
We can add the rest of the functions into MapOps and make the Internal module public.
OR keep the current version as compromised

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess we decide to put all operations in MapOps and leave Internal private for now at a 1-4% perf hit.
Does the perf hit disappear entirely if we inline the Internal functions in MapOps, or am I misunderstanding?

src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments...

src/PersistentOrderedMap.mo Show resolved Hide resolved
src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
src/PersistentOrderedMap.mo Outdated Show resolved Hide resolved
@GoPavel GoPavel changed the title Serokell: New PersistentOrderedMap.mo lib Serokell: [Milestone-1] New PersistentOrderedMap.mo lib Oct 14, 2024
s-and-witch and others added 10 commits October 21, 2024 17:31
No changes in logic so far, just simple refactoring
Add `MapOps` class with the following signature:

  public class MapOps<K>(compare : (K,K) -> O.Order) {

    public func put<V>(rbMap : Map<K, V>, key : K, value : V) : Map<K, V>

    public func fromIter<V>(i : I.Iter<(K,V)>) : Map<K, V>

    public func replace<V>(rbMap : Map<K, V>, key : K, value : V) : (Map<K,V>, ?V)

    public func mapFilter<V1, V2>(f : (K, V1) -> ?V2, rbMap : Map<K, V1>) : Map<K, V2>

    public func get<V>(key : K, rbMap : Map<K, V>) : ?V

    public func delete<V>(rbMap : Map<K, V>, key : K) : Map<K, V>

    public func remove< V>(rbMap : Map<K, V>, key : K) : (Map<K,V>, ?V)

  };

The other functionality provided as standalone functions, as they
don't require comparator:

  public type Direction = { #fwd; #bwd };

  public func iter<K, V>(rbMap : Map<K, V>, direction : Direction) : I.Iter<(K, V)>

  public func entries<K, V>(m : Map<K, V>) : I.Iter<(K, V)>

  public func keys<K, V>(m : Map<K, V>, direction : Direction) : I.Iter<K>

  public func vals<K, V>(m : Map<K, V>, direction : Direction) : I.Iter<V>

  public func map<K, V1, V2>(f : (K, V1) -> V2, rbMap : Map<K, V1>) : Map<K, V2>

  public func size<K, V>(t : Map<K, V>) : Nat

  public func foldLeft<Key, Value, Accum>(
    combine : (Key, Value, Accum) -> Accum,
    base : Accum,
    rbMap : Map<Key, Value>
  ) : Accum

  And foldRight with the same signature as foldLeft

The following functions are new for the API:
- MapOps.put, MapOps.delete
- MapOps.fromIter, entries, keys, vals
- MapOps.mapFilter, map
- foldLeft, foldRight
Problem: now order is not consistent within new module and with old
modules as well.

Solution: make the map argument always go first
In addition to tests this patch removes `direction`
argument from `keys` and `values` function to keep
them simple and provides a new function `Map.empty`
to create a map without knowing its internal representation.
* rename `rbMap` into `m` in signature for brevity & consistent language
* rename `rbMap` into `map` in examples for brevity & encapsulation sake
* rename `tree` into `map` in doc comments for the encapsulation sake
@GoPavel
Copy link
Author

GoPavel commented Oct 21, 2024

@crusso
I've rebased the branch and fixed the name rbMap in doc comments and signatures.
Please resolve conversations that you think should be resolved, so we will see if something left about his PR.
Also, I would ask you to approve the CI workflow since now I see "This workflow is awaiting approval from a maintainer in" here

Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments

/// with `Map` to maintain invariant that comparator did not changed.
///
/// `MapOps` contains methods that require `compare` internally:
/// operations that may reshape a `Map` or should find something.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess we decide to put all operations in MapOps and leave Internal private for now at a 1-4% perf hit.
Does the perf hit disappear entirely if we inline the Internal functions in MapOps, or am I misunderstanding?

test/PersistentOrderedMap.prop.test.mo Outdated Show resolved Hide resolved
@GoPavel
Copy link
Author

GoPavel commented Oct 24, 2024

So I guess we decide to put all operations in MapOps and leave Internal private for now at a 1-4% perf hit.
Does the perf hit disappear entirely if we inline the Internal functions in MapOps, or am I misunderstanding?

We have tried it, and it doesn't affect the benchmark at all. See serokell#21

crusso added a commit that referenced this pull request Nov 13, 2024
This is an MR for the 3rd Milestone of the Serokell's grant about
improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation
of the set data structure to the' base' library. Also, it brings a few
changes to the new functional map that was added in #664 , #654 .

# General changes:

* rename `PersistentOrderedMap` to `OrderedMap` (same for the
`OrderedSet`)
* improve docs

# Functional Map changes:

## New functionality:
+ add `any`/`all` functions
+ add `contains` function
+ add `minEntry`/`maxEntry`

## Optimizations:
+ Store `size` in the Map, [benchmark
results](serokell#35)

## Fixup: 
+ add `entriesRev()`, remove `iter()`

# NEW functional Set:

The new data structure implements an ordered set interface using
Red-Black trees as well as the new functional map from the 1-2
Milestones.

## API implemented:
* Basic operations (based on the map): `put`, `delete`, `contains`,
`fromIter`, etc
* Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight`
* Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal`
* Additional operations (as for the `OrderedMap`): `min`/`max`,
`all`/`some`

## Maintainance support:
* Unit, property tests
* Documentation

## Applied optimizations:

* Same optimizations that were useful for the functional map:
   * inline node color
   * float-out exceeded matching in iteration
   * `map`/`filterMap` through `foldLeft`
   * direct recursion in `foldLeft`
* [Benchmark results for all four optimizations
together](serokell#27)
* store size in the root of the tree, [benchmark
results](serokell#36 (comment))
* Pattern matching order optimization, [benchmark
results](serokell#36 (comment))
 * Other optimizations:
* Inline code of `OrderedMap` instead of sharing it, [benchmark
results](serokell#25)
* `intersect` optimization: use order of output values to build the
resulting tree faster, see
serokell#39
* `isSubset`, `equal` optimization: use early exit and use order of
subtrees to reduce intermediate tree height, see
serokell#37

## Rejected optimizations:

* Nipkow's implementation of set operation [Tobias Nipkow's "Functional
Data Structures and Algorithms", 117].
Initially, we were planning to use an implementation of set operations
(`intersect`, `union`, `diff`) from Nipkow's book. However, the
experiment shows that naive implementation with a simple size heuristic
performs better. [The benchmark
results](serokell#33) are comparing
3 versions:
* persistentset_baseline -- original implementation that uses Nipkow's
algorithms. However, the black height is calculated before each set
operation (the book assumes it's stored).
* persistentset_bh -- the same as the baseline but the black height is
stored in each node.
* persistentset -- naive implementation that looks up in a smaller set
and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which
is very close to Nipkow's version). Sizes of sets are also stored but
only in the root.
The last one outperforms others and keeps a tree slim in terms of byte
size. Thus, we have picked it.

## Final benchmark results:

### Collection benchmarks

| |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove
50|upgrade|
|--:|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108|
|trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696|

|orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579|

|trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466|

|orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120|

|trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045|

|orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232|

|trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388|

|orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778|

|trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492|

### set API

| |size|intersect|union|diff|equals|isSubset|
|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|100|146_264|157_544|215_871|28_117|27_726|
|trieset+100|100|352_496|411_306|350_935|201_896|201_456|
|orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938|
|trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673|
|orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591|

|trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381|
|orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393|

|trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346|

|orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558|

|trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151|

### new set API

| |size|foldLeft|foldRight|mapfilter|map|
|--:|--:|--:|--:|--:|--:|
|orderedset|100|16_487|16_463|88_028|224_597|
|orderedset|1000|133_685|131_953|1_526_510|4_035_782|
|orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733|
|orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463|
|orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902|

---------

Co-authored-by: Andrei Borzenkov <andreyborzenkov2002@gmail.com>
Co-authored-by: Andrei Borzenkov <root@sandwitch.dev>
Co-authored-by: Sergey Gulin <sergeygulin96@gmail.com>
Co-authored-by: Claudio Russo <claudio@dfinity.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants