-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Dial down detail of B-tree description #135761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could not assign reviewer from: |
/// A B-tree resembles a [binary search tree], but each leaf (node) contains | ||
/// an entire array (of unspecified size) of elements, instead of just a single element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this talk about leaves is taken from #134088 but it doesn't reflect how BTreeMap
works. All nodes (leaf and interior nodes) contain an array of elements and search may terminate in any them, not just in leaves. That's also closer to how binary search trees usually work. B+ trees, a variant of the traditional B-tree, do the "only leaves store elements, interior nodes guide search to the right leaf" thing but Rust's BTree*
types are not like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so IIUC each node either has the element in its array, or it is in the left child, or it is in the right child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now corrected to "node" instead of "leaf (node)".
/// A search first traverses the tree structure to find, in logarithmic time, the correct leaf. | ||
/// This leaf is then searched linearly, which is very fast on modern hardware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may just be collateral damage from the leaves vs interior distinction, but it seems strange to only mention the linear search for leaves. The current implementation does it for every node, if it's mentioned for one it should be mentioned for both. Alternatively, it may be considered an implementation detail that distracts from the main point (logarithmic time complexity w.r.t. total number of elements, despite linear search in each node). But then it shouldn't be mentioned for leaves either. Certainly the way each node is searched is less central to the mechanical sympathy of B-trees than the "B" constant: fewer and larger nodes are obviously good (up to a point), while linear vs binary search is a more nuanced decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is definitely collateral damage from my faulty guessing at the internal workings. I guess I was too hasty dismissing this possibility as not being compatible with logarithmic behavior, but I see now that I was wrong. Thanks for clarifying.
I don't agree about larger nodes being obviously good. I think it has more to do with the minimum number of bits that a real machine can look at, due to the size of its cache lines or some other detail. If your algorithm uses smaller chunks, then it is wasting bandwidth. So I would say smaller is obviously better (since it gets you from linear search to logarithmic search), but only up to a point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this detail and included something about the natural granularity of data for modern machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of comparison against BST, anything more than one key per node is a "larger node". You're right that the granularity of memory/storage access (cache lines in main memory, block/page size for disks) is a useful rule of thumb. Larger nodes also make insertion and deletion more expensive. But you're mistaken about search complexity: as the existing comments note, you can maintain optimal number of comparisons (up to a small constant factor) by doing binary search in each node. Sometimes that turns out a bit slower than a linear search, but it's competitive or superior in other cases.
In any case, I don't think it's useful to try and communicate these nuances in the BTreeMap
docs.
/// A B-Tree instead makes each node contain B-1 to 2B-1 elements in a contiguous array. By doing | ||
/// this, we reduce the number of allocations by a factor of B, and improve cache efficiency in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do want to salvage e.g. a brief mention of fewer nodes with more elements being better for reducing heap allocations and cache misses, consider this suggestions from #134088 (comment)
If we do wish to explain the theory behind B-tree performance, we can do it near the end, when we are not crushed for time and space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've restored the old comments with the suggested fixes from @danielrainer from #134088 in a new background section below everything.
/// and possibly other factors. Using linear search, searching for a random element is expected | ||
/// to take B * log(n) comparisons, which is generally worse than a BST. In practice, | ||
/// however, performance is excellent. | ||
/// A B-tree resembles a [binary search tree], but each leaf (node) contains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @workingjubilee (#134088 (comment)) that
our first primary goal is not to explain "why not BinarySearchTreeMap?" We don't have a BinarySearchTreeMap, but we do have HashMap, so our first task is "why should you EVER look at this data structure and not HashMap?" This is true even if we never directly compare, in the type's docs, against HashMap.
So while you're already condensing the existing description, do you think you could also include this aspect? Essentially, say more about this being an "ordered map" before diving into how it maintains order internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included a short introduction that focuses on the ordered aspect and lifted the existing text about the order produced by iterators up. Also the section about not mucking about with the order is now much closer to the top and less unexpected (hopefully).
For any of the details that are still correct but just shouldn't be user-facing, leaving them as a non-doc comment seems preferable to removing them. Cc @workingjubilee if you want to yoink this review since you were involved in #134088 |
@tgross35 I've restored the old comments (with small corrections from the original bug report) to a new section at the end. |
/// triggers a heap-allocation, and every single comparison should be a cache-miss. Since these | ||
/// are both notably expensive things to do in practice, we are forced to, at the very least, | ||
/// reconsider the BST strategy. | ||
/// An ordered map is a map in which the keys are totally ordered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've occasionally seen people be confused about ordered maps in this sense vs. "map that maintains some consistent order" (e.g., insertion order) as exemplified by Python's OrderedDict
or http://crates.io/crates/ordermap in Rust. I'd throw in something about the entries being stored in sorted order before detailing what determines the sort order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valuable insight and I can see how the first sentence can be mistaken for talking about a map with insertion order. It now says: "Given a key type with a [total order], an ordered map stores its entries in key order." which I feel is much harder to misunderstand. Let me know if you have more valuable suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have any further suggestions but I gave up my bors permissions in 2020 (wow, that long ago?) when I stopped contributing regularly, so I can't move this any further. It's up to @tgross35 or another reviewer if they're too busy.
/// | ||
/// # Background | ||
/// | ||
/// A B-tree is (like) a [binary search tree], but adapted to the natural granularity that modern machines like to consume data at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this line a little too long? But I don't see tidy complaining so idk, just looks out of place 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this should be wrapped.
@tgross35 or @workingjubilee, would be great if you could take a look. |
I didn't realize this was still assigned to me, I'm not entirely sure what the vision from the original issue was |
Could not assign reviewer from: |
Ah sorry, I'll take that back then and try to have a look |
@tgross35 friendly ping |
Argh sorry, I thought this got rerolled a long time ago (probably should have). I still have no strong thoughts here so I'm just going to do that now r? libs |
/// triggers a heap-allocation, and for every comparison a node needs to be loaded, | ||
/// which could result in a cache miss. Since both heap-allocations and cache-misses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// triggers a heap-allocation, and for every comparison a node needs to be loaded, | |
/// which could result in a cache miss. Since both heap-allocations and cache-misses | |
/// triggers a heap-allocation, and comparison is a potential cache-miss due to the indirection. | |
/// Since both heap-allocations and cache-misses |
r=me after the nits. |
a9447d1
to
f0e44d6
Compare
This comment has been minimized.
This comment has been minimized.
f0e44d6
to
1b77387
Compare
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
@rustbot ready |
@bors r+ rollup |
Dial down detail of B-tree description fixes rust-lang#134088, though it is a shame to lose some of this wonderful detail. r? `@workingjubilee` EDIT: newest versions keep old detail, but move it down a bit.
Dial down detail of B-tree description fixes rust-lang#134088, though it is a shame to lose some of this wonderful detail. r? ``@workingjubilee`` EDIT: newest versions keep old detail, but move it down a bit.
Rollup of 6 pull requests Successful merges: - #135761 (Dial down detail of B-tree description) - #144373 (remove deprecated Error::description in impls) - #145620 (Account for impossible bounds making seemingly unsatisfyable dyn-to-dyn casts) - #145783 (add span to struct pattern rest (..)) - #145817 (cg_llvm: Replace the `llvm::Bool` typedef with a proper newtype) - #145820 (raw-dylib-elf: set correct `DT_VERDEFNUM`) r? `@ghost` `@rustbot` modify labels: rollup
Dial down detail of B-tree description fixes rust-lang#134088, though it is a shame to lose some of this wonderful detail. r? ```@workingjubilee``` EDIT: newest versions keep old detail, but move it down a bit.
Rollup of 5 pull requests Successful merges: - #135761 (Dial down detail of B-tree description) - #144373 (remove deprecated Error::description in impls) - #145620 (Account for impossible bounds making seemingly unsatisfyable dyn-to-dyn casts) - #145817 (cg_llvm: Replace the `llvm::Bool` typedef with a proper newtype) - #145820 (raw-dylib-elf: set correct `DT_VERDEFNUM`) r? `@ghost` `@rustbot` modify labels: rollup
Dial down detail of B-tree description fixes rust-lang#134088, though it is a shame to lose some of this wonderful detail. r? ````@workingjubilee```` EDIT: newest versions keep old detail, but move it down a bit.
This comment has been minimized.
This comment has been minimized.
Dial down detail of B-tree description try-job: dist-various-2
Rollup of 10 pull requests Successful merges: - #135761 (Dial down detail of B-tree description) - #145620 (Account for impossible bounds making seemingly unsatisfyable dyn-to-dyn casts) - #145788 (Fix attribute target checking for macro calls) - #145794 (bootstrap.py: Improve CPU detection on NetBSD) - #145817 (cg_llvm: Replace the `llvm::Bool` typedef with a proper newtype) - #145820 (raw-dylib-elf: set correct `DT_VERDEFNUM`) - #145828 (Update `bitflags` to 2.9.3.) - #145830 (Remove the lifetime from `ExpTokenPair`/`SeqSep`.) - #145836 (Remove outdated bug comments) - #145842 (rustc-dev-guide subtree update) r? `@ghost` `@rustbot` modify labels: rollup
Dial down detail of B-tree description fixes rust-lang#134088, though it is a shame to lose some of this wonderful detail. r? `@workingjubilee` EDIT: newest versions keep old detail, but move it down a bit.
Rollup of 10 pull requests Successful merges: - rust-lang/rust#135761 (Dial down detail of B-tree description) - rust-lang/rust#145620 (Account for impossible bounds making seemingly unsatisfyable dyn-to-dyn casts) - rust-lang/rust#145788 (Fix attribute target checking for macro calls) - rust-lang/rust#145794 (bootstrap.py: Improve CPU detection on NetBSD) - rust-lang/rust#145817 (cg_llvm: Replace the `llvm::Bool` typedef with a proper newtype) - rust-lang/rust#145820 (raw-dylib-elf: set correct `DT_VERDEFNUM`) - rust-lang/rust#145828 (Update `bitflags` to 2.9.3.) - rust-lang/rust#145830 (Remove the lifetime from `ExpTokenPair`/`SeqSep`.) - rust-lang/rust#145836 (Remove outdated bug comments) - rust-lang/rust#145842 (rustc-dev-guide subtree update) r? `@ghost` `@rustbot` modify labels: rollup
fixes #134088, though it is a shame to lose some of this wonderful detail.
r? @workingjubilee
EDIT: newest versions keep old detail, but move it down a bit.