-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serokell: [Milestone-1] New PersistentOrderedMap.mo lib #654
base: master
Are you sure you want to change the base?
Conversation
Dear @GoPavel, In order to potentially merge your code in this open-source repository and therefore proceed with your contribution, we need to have your approval on DFINITY's CLA. If you decide to agree with it, please visit this issue and read the instructions there. Once you have signed it, re-trigger the workflow on this PR to see if your code can be merged. — The DFINITY Foundation |
740c391
to
41b6516
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some interim comments (more to come, just submitting because I expect GH might lost my review soon)
/// with `Map` to maintain invariant that comparator did not changed. | ||
/// | ||
/// `MapOps` contains methods that require `compare` internally: | ||
/// operations that may reshape a `Map` or should find something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// operations that may reshape a `Map` or should find something. | |
/// operations that update, traverse or search a map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think traverse
doesn't require comp
(since we keep tree structure) so it's outside of the MapOps
.
So I think it should be
/// operations that may reshape a `Map` or should find something. | |
/// operations that update or search a map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's a UX question - should we put all operations in MapOps, even if they don't require the comparison, just for uniformity? I guess we could also do that later easily enough. Something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we put all operations in MapOps
I also thought about this. We were going to measure its overhead on a benchmark and then decide. If there is no difference I would also put it there.
However for now some modules like RBTree
or TrieMap
provide some functionality as a module-function, not as a member-function e.g. map
or iter
. So we thought that the current solution is not very inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did performance experiments:
- Comparing inlined internal functions and the current version (all calls through
MapOps
): [DMS-68] Inline internal functions into MapOps object serokell/motoko-base#21 - Comparing calling methods through MapOps and through the
Internal
module passingcomp
manually: [DMS-68] Compare function calls from MapOps with direct function call serokell/motoko-base#20
The 1st experiment shows that the inlined version has more or less the same performance (<1% difference).
The 2nd one shows that a static function call is faster than a similar one from MapOps
on 1-4% percent.
Thus we can say that static calls are less convenient but faster, so probably it makes sense to keep the option ?
We can add the rest of the functions into MapOps
and make the Internal
module public.
OR keep the current version as compromised
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess we decide to put all operations in MapOps and leave Internal private for now at a 1-4% perf hit.
Does the perf hit disappear entirely if we inline the Internal functions in MapOps, or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more comments...
No changes in logic so far, just simple refactoring
Add `MapOps` class with the following signature: public class MapOps<K>(compare : (K,K) -> O.Order) { public func put<V>(rbMap : Map<K, V>, key : K, value : V) : Map<K, V> public func fromIter<V>(i : I.Iter<(K,V)>) : Map<K, V> public func replace<V>(rbMap : Map<K, V>, key : K, value : V) : (Map<K,V>, ?V) public func mapFilter<V1, V2>(f : (K, V1) -> ?V2, rbMap : Map<K, V1>) : Map<K, V2> public func get<V>(key : K, rbMap : Map<K, V>) : ?V public func delete<V>(rbMap : Map<K, V>, key : K) : Map<K, V> public func remove< V>(rbMap : Map<K, V>, key : K) : (Map<K,V>, ?V) }; The other functionality provided as standalone functions, as they don't require comparator: public type Direction = { #fwd; #bwd }; public func iter<K, V>(rbMap : Map<K, V>, direction : Direction) : I.Iter<(K, V)> public func entries<K, V>(m : Map<K, V>) : I.Iter<(K, V)> public func keys<K, V>(m : Map<K, V>, direction : Direction) : I.Iter<K> public func vals<K, V>(m : Map<K, V>, direction : Direction) : I.Iter<V> public func map<K, V1, V2>(f : (K, V1) -> V2, rbMap : Map<K, V1>) : Map<K, V2> public func size<K, V>(t : Map<K, V>) : Nat public func foldLeft<Key, Value, Accum>( combine : (Key, Value, Accum) -> Accum, base : Accum, rbMap : Map<Key, Value> ) : Accum And foldRight with the same signature as foldLeft The following functions are new for the API: - MapOps.put, MapOps.delete - MapOps.fromIter, entries, keys, vals - MapOps.mapFilter, map - foldLeft, foldRight
Problem: now order is not consistent within new module and with old modules as well. Solution: make the map argument always go first
In addition to tests this patch removes `direction` argument from `keys` and `values` function to keep them simple and provides a new function `Map.empty` to create a map without knowing its internal representation.
* rename `rbMap` into `m` in signature for brevity & consistent language * rename `rbMap` into `map` in examples for brevity & encapsulation sake * rename `tree` into `map` in doc comments for the encapsulation sake
@crusso |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments
/// with `Map` to maintain invariant that comparator did not changed. | ||
/// | ||
/// `MapOps` contains methods that require `compare` internally: | ||
/// operations that may reshape a `Map` or should find something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess we decide to put all operations in MapOps and leave Internal private for now at a 1-4% perf hit.
Does the perf hit disappear entirely if we inline the Internal functions in MapOps, or am I misunderstanding?
We have tried it, and it doesn't affect the benchmark at all. See serokell#21 |
This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library. The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 . # General changes: * rename `PersistentOrderedMap` to `OrderedMap` (same for the `OrderedSet`) * improve docs # Functional Map changes: ## New functionality: + add `any`/`all` functions + add `contains` function + add `minEntry`/`maxEntry` ## Optimizations: + Store `size` in the Map, [benchmark results](serokell#35) ## Fixup: + add `entriesRev()`, remove `iter()` # NEW functional Set: The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones. ## API implemented: * Basic operations (based on the map): `put`, `delete`, `contains`, `fromIter`, etc * Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight` * Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal` * Additional operations (as for the `OrderedMap`): `min`/`max`, `all`/`some` ## Maintainance support: * Unit, property tests * Documentation ## Applied optimizations: * Same optimizations that were useful for the functional map: * inline node color * float-out exceeded matching in iteration * `map`/`filterMap` through `foldLeft` * direct recursion in `foldLeft` * [Benchmark results for all four optimizations together](serokell#27) * store size in the root of the tree, [benchmark results](serokell#36 (comment)) * Pattern matching order optimization, [benchmark results](serokell#36 (comment)) * Other optimizations: * Inline code of `OrderedMap` instead of sharing it, [benchmark results](serokell#25) * `intersect` optimization: use order of output values to build the resulting tree faster, see serokell#39 * `isSubset`, `equal` optimization: use early exit and use order of subtrees to reduce intermediate tree height, see serokell#37 ## Rejected optimizations: * Nipkow's implementation of set operation [Tobias Nipkow's "Functional Data Structures and Algorithms", 117]. Initially, we were planning to use an implementation of set operations (`intersect`, `union`, `diff`) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. [The benchmark results](serokell#33) are comparing 3 versions: * persistentset_baseline -- original implementation that uses Nipkow's algorithms. However, the black height is calculated before each set operation (the book assumes it's stored). * persistentset_bh -- the same as the baseline but the black height is stored in each node. * persistentset -- naive implementation that looks up in a smaller set and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which is very close to Nipkow's version). Sizes of sets are also stored but only in the root. The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it. ## Final benchmark results: ### Collection benchmarks | |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove 50|upgrade| |--:|--:|--:|--:|--:|--:|--:|--:| |orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108| |trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696| |orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579| |trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466| |orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120| |trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045| |orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232| |trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388| |orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778| |trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492| ### set API | |size|intersect|union|diff|equals|isSubset| |--:|--:|--:|--:|--:|--:|--:| |orderedset+100|100|146_264|157_544|215_871|28_117|27_726| |trieset+100|100|352_496|411_306|350_935|201_896|201_456| |orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938| |trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673| |orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591| |trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381| |orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393| |trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346| |orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558| |trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151| ### new set API | |size|foldLeft|foldRight|mapfilter|map| |--:|--:|--:|--:|--:|--:| |orderedset|100|16_487|16_463|88_028|224_597| |orderedset|1000|133_685|131_953|1_526_510|4_035_782| |orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733| |orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463| |orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902| --------- Co-authored-by: Andrei Borzenkov <andreyborzenkov2002@gmail.com> Co-authored-by: Andrei Borzenkov <root@sandwitch.dev> Co-authored-by: Sergey Gulin <sergeygulin96@gmail.com> Co-authored-by: Claudio Russo <claudio@dfinity.org>
This is an MR for the 1st Milestone of Serokell's grant about improving Motoko's base library.
Within it, we are introducing a new functional implementation of the map data structure based on the current
RBTree.mo
.Content of the MR:
RBTree.mo
map
foldLeft
,foldRight
mapFilter
keys
,vals
?V -> V