Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ordinary-index building without a page count bound #419

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jeltsch
Copy link
Collaborator

@jeltsch jeltsch commented Oct 4, 2024

So far, incremental construction of an ordinary index required users to provide an upper bound of the number of pages or equivalently the number of keys stored in the index. However, such an upper bound cannot be reliably computed.

This pull request introduces an implementation of “growing vectors” and adapts ordinary-index construction to make use of it. As a result, it is not necessary anymore to provide an upper bound of the number of keys but only an initial size of the key buffer. For performance reasons, it is good for this initial size to be an upper bound of the number of keys, but index construction will also work if it is not.

This pull request does not contain tests for growing vectors. These will be added later by resolving #420.

@jeltsch jeltsch added the enhancement New feature or request label Oct 4, 2024
@jeltsch jeltsch self-assigned this Oct 4, 2024
Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few small suggestions

Comment on lines +108 to +117
{-|
Turns a growing vector into an ordinary vector, thereby invalidating the
growing vector. Executing @unsafeFreeze vec@ is only safe when @vec@ is not
used afterwards.
-}
unsafeFreeze :: GrowingVector s a -> ST s (Vector a)
unsafeFreeze (GrowingVector bufferRef lengthRef) = do
buffer <- readSTRef bufferRef
length <- readPrimVar lengthRef
force <$> Mutable.unsafeFreeze (Mutable.take length buffer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, force will always copy the contents into a new vector. As such, this implementation is actually a safe freeze

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wondered about this already before sending you this pull request. I have checked now: From a quick look at the code, I get indeed the impression that force will always copy the contents. However, the API documentation says this:

Yield the argument, but force it not to retain any extra memory, possibly by copying it.

So the specification leaves open the possibility of force returning the original vector if this vector uses its underlying array in full. We might hope that future versions of the vector package won’t make use of this opportunity, but I’d like to err on the safe side and continue considering the above function unsafe.

new initialBufferSize
= do
buffer <- Mutable.new initialBufferSize
bufferRef <- newSTRef buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bufferRef <- newSTRef buffer
bufferRef <- newSTRef $! buffer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don’t think this is necessary. Since we’re using the strict version of ST, the computation Mutable.new initialBufferSize will definitely be performed before newSTRef buffer, and I’d be very surprised if new wouldn’t always yield a fully evaluated result. Thoughts?


makeRoom :: ST s ()
makeRoom = do
length <- readPrimVar lengthRef
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This length was just read in the do block above

Copy link
Collaborator Author

@jeltsch jeltsch Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re absolutely right, and I was fully aware of that when I filed this pull request. 🙂 I’m not quite sure how to handle this situation. I would like to have exactly the code for “making room” be separated, so that it doesn’t clutter the main code. However, self-contained code for making room needs to contain the initial reading of the length, which fulfills a different purpose than the reading of the length in the main code, which is for saving the original length in order to use it later in a computation. An alternative would be to pass the length as an argument to makeRoom, but this seems a bit awkward to me. Would you be fine with keeping the code as it is now? Do you perhaps have a different idea of how to deal with this problem?

let

higherBufferSizes :: [Int]
higherBufferSizes = tail (init ++ [last]) where
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maxBound should be part of the higher buffer sizes as well. last could be much less than maxBound

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered this as well but then dropped this idea. The thing is that, while maxBound could be as “low” as 2 ^ 27 - 1 according to the Haskell Report, in practice it is likely to be 2 ^ 63 - 1 and thus so high that the number of pages will probably never get anywhere close to it. With the current implementation, the last element of higherBufferSizes is at least 2 ^ 62 and thus still so high that it will probably never be needed. I think that overflow of buffer sizes should be prevented even if it is unlikely to ever occur, simply to be safe and for the code to be correct. However, I don’t think we have to introduce the extra, albeit small, complexity that comes from adding maxBound to higherBufferSizes, given that doing so probably wouldn’t make any difference in practice. What are your thoughts on this?

Comment on lines +104 to +105
-> Mutable.grow buffer (bufferSize' - bufferSize) >>=
writeSTRef bufferRef
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-> Mutable.grow buffer (bufferSize' - bufferSize) >>=
writeSTRef bufferRef
-> Mutable.grow buffer (bufferSize' - bufferSize) >>=
(writeSTRef bufferRef $!)

Copy link
Collaborator Author

@jeltsch jeltsch Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ($!) is not needed here, for the same reasons as in the invocation of newSTRef above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants