Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTrees in Motoko. #396

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

BTrees in Motoko. #396

wants to merge 10 commits into from

Conversation

matthewhammer
Copy link
Contributor

@matthewhammer matthewhammer commented Jun 23, 2022

@matthewhammer
Copy link
Contributor Author

matthewhammer commented Jun 23, 2022

Status:

  • provisional data type definitions for the B-Tree
  • Check module checks its invariants

Still to do:

  • find
  • insert
  • insertSorted
  • remove

And then usual utilities like iteration, mapping, etc.

Copy link
Contributor

@ByronBecker ByronBecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and suggestions - thanks for starting into this 👍

/// Constants we use to shape the tree.
/// See https://en.wikipedia.org/wiki/B-tree#Definition
module Constants {
let MAX_CHILDREN = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the max 4? https://panthema.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html shows that 32-128 perform considerably better at large n.

I think it therefore might make sense to allow the developer to configure larger child values (i.e. 4, 16, 32, 64, 128, 256). I'd be curious to run some performance tests inserting a running counter or batch of elements (ordered or unordered) to see what difference this might make as the tree grows in size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to write simple tests before adjusting this number into something that varies, which I agree is desirable.

src/BTree.mo Outdated

public type Tree<K, V> = {
#index : Index<K, V>;
#data : Data<K, V>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Tree variant's values be renamed #node and #leaf? Index makes it sound more like a hash table type of lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, except that I'd prefer to call it "internal" and "leaf", since both are kinds of "nodes" in my mind.

"Index" seemed more natural when I was misunderstanding the structure of a B-Tree, and thinking of it as a B+-Tree (I guess?) where the data is only at the leaves; then the internal nodes only store keys, and serve as an index.

Now, I've adopted the standard B-Tree definition where key-value pairs can be internal too, and the name "index" is more inappropriate.

BTW, as one reference for "standard", I've been looking at the Rust implementation in std, and the helpful comments in particular. Hopefully this implementation will be much (much!) shorter, at least initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm fine with both of those as long as there's a comment above the type definition for the reader.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with "internal" and "leaf".

What should the comment say?

src/BTree.mo Outdated
#data : Data<K, V>;
};

func find_data<K, V>(data : Data<K, V>, find_k : K, c : Compare<K>) : ?V {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c can just be of type (K, K) -> Order.Order.

So the function signature can become

func find_data<K, V>(data : Data<K, V>, find_k : K, compareTo: (K, K) -> Order.Order) : ?V {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the advantage of that? Less cognitive overhead in some way?

I actually like the record with a function inside pattern, even with just one function: If I want to add debugging later, or any other operations, it helps to have a record in place already to place those operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change the API because of concern about adding debugging.

If you want backwards compatibility and a more flexible API, why not just make the function parameters a record?

func find_data<K, V>(options: {
  data : Data<K, V>;
  find_k : K;
  compareTo: (K, K) -> Order.Order;
  // add new optional parameters here later if desired
}) : ?V {
 ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I've removed that record-of-function from the public-facing API used in tests, etc.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it was indeed useful to keep in in the checking API (just for tests), since then I could debug by printing out keys from within the BTree code.

Without having a show function for the key type, there's no way for the generic code to print them during debugging phases of development.

compare : (K, K) -> Order.Order
};

public type Data<K, V> = [(K, V)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what's the benefit of the Data<K, V> type being the tuple [(K, V)] over something like

public type Data<K, V> = {
  key: K;
  value: V;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight space savings to the tuple (1 word per instance). Otherwise, I'd prefer the record with labeled fields.

I actually had that same record definition initially, but recalled this recent review from Claudio for the HashMap improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why is Data an array and not just (K, V)? I thought the data of each leaf in the index was already folded into the index type here

  public type Index<K, V> = {
    data : Data<K, V>; // data value pointing to this index
    trees : [Tree<K, V>]; // each of these sub-trees in the array has their own (K, V)
  };

I'm probably missing something as it's early and I definitely didn't get enough sleep last night 😅

Is there a reference implementation you used for this that I can peek through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/BTree.mo Outdated
Comment on lines 46 to 50
for (j in I.range(0, i.data.size())) {
switch (c.compare(k, i.data[j].0)) {
case (#equal) { return ?i.data[j].1 };
case (#less) { return find<K, V>(i.trees[j], k, c) };
case _ { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since trees is an ordered array, this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Yes, true!

OTOH, that would not be a "baby step". : )

Ideally, we'd profile both variations, at various choices of m (the branching factor for B-Trees seems to be called m by everyone for some reason) and we'd have some way to know if there's a benefit, and how much. There may be little benefit when m is small, or it may even be worse (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

Within the compiler, tail recursion to the same function gets transformed into a loop, with no stack/recursion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference in performance in terms of what it gets transformed down to? I'm planning on writing an imperative RBTree later this week to test out the RBTree performance issues I was running into in #390

Copy link
Contributor

@ByronBecker ByronBecker Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might end up being the case that a BTree with index/sub-tree size 16/32/64 removes this issue since the tree is more shallow and there's less rebalancing steps required 🤷‍♂️

Just for context, the reason the insertion performance is important to me is that I'd prefer to find a balanced data structure solution that avoids hitting the message instruction limit and doesn't rely on deterministic time slicing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think B-Trees will be shallow for any realistic internal node size (16, 32, etc.).

As mentioned above, in Motoko, any self-contained tail-recursive function (not two mutually-recursive ones) will become a single while loop without a stack during compilation.

For any realistic BTree, that stack-free loop should run at most a handful of iterations (less than 10), so the cycle limit will certainly not matter for searching, even if you need to search for a bunch of records in random positions all within one message response time.

src/BTree.mo Outdated
/// Check that a B-Tree instance observes invariants of B-Trees.
/// Invariants ensure performance is what we expect.
/// For testing and debugging.
public module Check {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know we could have more than one module in a file! I learn something new every day :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! And they can even vary in whether they are public or private, and nest to any depth.

All imports have to be at the top of the file though (very unlike Rust).

src/BTree.mo Outdated
compare : (?K, ?K) -> Order.Order
};

func compareOp<K>(c : Compare<K>) : CompareOp<K> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment on the Compare type to above

src/BTree.mo Outdated Show resolved Hide resolved
src/BTree.mo Outdated
};

func index<K, V>(lower : ?K, c : CompareOp<K>, i : Index<K, V>, upper : ?K) {
assert (i.data.size() + 1 == i.trees.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this keep the invariant? I thought i.data was just [(K, V)], whereas i.trees could be any size array specified by the BTree child size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand the question entirely, but the arrays of (sub)trees and keys (in data) are highly related under my current understanding of B-Trees, and the way they work in the rust implementation.

From that work, I get that this property holds (related to assert (i.data.size() + 1 == i.trees.size());)

For each pair of consecutive subtrees (t1, t2), there is a key k12 (and value v12) such that keysOf(t1) < k12 < keysOf(t2)

(abusing notation to compare sets of keys with a single key; what I mean is every key in each set observes the relevant comparison operator holding.)

I found it very helpful to look at pictures that people drew, and draw some myself. For instance, the one on the wikipedia page shows this pattern where keys are shown between consecutive subtrees..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I strongly suspect that this invariant code is wrong. It's a baby step that will need to be revised, probably a few times. Thanks for taking a look!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on briefly looking at that wikipedia page image, let me know if I'm understanding this correctly...

So for each new level/sub-tree the "trees" at that level get instantiated to be a specific size array with each data slot containing a default "empty" value until filled.

I understand this invariant a bit better now given my response to this comment of yours.

It has to do that you have both a data array and a sub-trees array - which I argue could be folded into one array, and just have a single data tuple (and not a data tuple array).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the B-Tree "tricky" as a simple type def, IMO is the combination of array structure and tree structure, and how it mixes them together in an efficient implementation that uses arrays to organize things of the same type.

In a "usual" internal node, there is a key-value between each pair of consecutive sub-trees, giving a key that separates these sub-trees' keys.

But there is no key-value pair on either end of the sub-tree list, so in total, there is one fewer key-value pairs than sub-trees.

For instance, see this line of the checking algorithm.

The Rust implementation linked above does this, based on me reading the comments (not the code itself).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I understand completely now! Thanks for explaining.

@matthewhammer
Copy link
Contributor Author

Hey @ByronBecker thanks for the lively discussion about the initial code!

I've taken our discussion into account in the latest commit, which also adds unit tests for some tiny cases.

Please let me know what you think about the revisions, when you have time.

Copy link
Contributor

@ByronBecker ByronBecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments/suggestions. Looking forward to insert 🎉

Comment on lines 42 to 47
let _ = Suite.suite("find", [
Suite.test("pine",
BT.find<Text, Nat>(binary_internal, "pine", Text.compare),
M.equals(T.optional<Nat>(T.natTestable, ?42))
)
]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 To using matchers instead of debug statements for tests.

This is the pattern I've been using for my tests.

I create a "suite" for a specific function, like here where I have all of the different test cases for that function inside it.

Then I run the test suite for each function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 To using matchers instead of debug statements for tests.

For me, it's not one or the other. It's "which here?" -- Like all tools and techniques, there is a place for matches, but it's not a universal solution, or even close to one.

For instance, what value would it add to the first tests, where I just want to define a bunch of trees using let expressions that scope in a nested way, and assert that they are valid? To me, plain Motoko is the right structure for that test, and I used Debug prints so I know where I am in that script.

How does matchers help there? I can't see it -- It's not really helping guide that sort of test structure at all, and that's fine. I think it's okay to use it only where it helps --- organizing tests of an API, ensuring that each operation has some coverage. (I plan to use it for the rest of the API coverage tests.)

};
};

/// Check that a B-Tree instance observes invariants of B-Trees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module is just for testing/debugging, should Check be in a different file?

When I import the default BTree module like import BTree "mo:base/BTree";, does it shake out the Check, or does it include it and bloat the import size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it shake out the Check, or does it include it and bloat the import size?

This module is static. The compiler will not compile static code that you import but do not use. (However, class instances will always contain all methods, regardless of usage. Not applicable here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module is just for testing/debugging, should Check be in a different file?

Why? It's more enclosed and encapsulated here. There is no way to "hide" a top-level module in base.

Comment on lines +99 to +100
if (i.data.size() == 0) { assert i.trees.size() == 0; return };
if (i.trees.size() < 2) { assert false };
Copy link
Contributor

@ByronBecker ByronBecker Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to having all of the invalid cases trap vs. have be this be more like a isBTreeValid() function?

I guess the question from a usability standpoint is do you see developers using it like

try {
  Check.root<Text, Text>(Text.compare, bt);
  bt;
} catch e {
  // invalid btree logic here
}

or

do {
  if (isBTreeValid<Text, Text>(Text.compare, bt) {
    bt
  } else {
    // invalid btree logic here
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert is easier for this phase of development.

Indeed, using assert does not lead to composable code (the second version of the code you gave wouldn't work), so eventually, to make these checks more composable and thus more useful, they should either return ? Error or perhaps Result<_, Error> for some well-defined variant type of Error (to be defined) -- not sure about the value of returning anything in the #ok case, so that's why I'd tend toward returning ? Error and have null mean "okay", no errors. Then, a wrapper can check for null and return true, or some other kind of Result, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert is easier for this phase of development.

My intentions are time-variant:

  1. Now, before all functions are defined, I'd like to focus on writing them, and using these assertions to check that they are correct (after each unit test, I'd use the assertion code to ensure that the unit tests' operations preserve all invariants of the BTrees.). They need not be composable to be used in unit tests.

  2. Once the basic API works, we can extend it with some invariant checks, if that seems useful. Or, they could remain as is, just used in unit tests, depending. If we do expose them, I envision them returning a value, not trapping.

Comment on lines +72 to +82
public func assertIsValid<K, V>(
t : Tree<K, V>,
compare : (K, K) -> Order.Order,
show : K -> Text)
{
Check.root<K, V>({compare; show}, t)
};

public func assertIsValidTextKeys<V>(t : Tree<Text, V>){
Check.root<Text, V>({compare=Text.compare; show=func (t:Text) : Text { t }}, t)
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help if Check and these test methods were just moved into test/BTreeTest.mo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants