New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enable ordinary-index building without a page count bound #419

Open

jeltsch wants to merge 1 commit into main from jeltsch/growing-vector

+141 −45

Collaborator

jeltsch commented Oct 4, 2024 •

edited

Loading

So far, incremental construction of an ordinary index required users to provide an upper bound of the number of pages or equivalently the number of keys stored in the index. However, such an upper bound cannot be reliably computed.

This pull request introduces an implementation of “growing vectors” and adapts ordinary-index construction to make use of it. As a result, it is not necessary anymore to provide an upper bound of the number of keys but only an initial size of the key buffer. For performance reasons, it is good for this initial size to be an upper bound of the number of keys, but index construction will also work if it is not.

This pull request does not contain tests for growing vectors. These will be added later by resolving #420.

jeltsch added the enhancement label

jeltsch self-assigned this

jeltsch requested review from dcoutts, jorisdral, mheinzel and recursion-ninja as code owners

October 4, 2024 23:14

jeltsch force-pushed the jeltsch/growing-vector branch from 6bc89d6 to 1db7791 Compare

October 4, 2024 23:20

jeltsch mentioned this pull request

Add tests for growing vectors #420

Open


          Enable ordinary-index building without a page count bound

db4f2f9

jeltsch force-pushed the jeltsch/growing-vector branch from 1db7791 to db4f2f9 Compare

October 4, 2024 23:29

jorisdral requested changes

View reviewed changes

Collaborator

jorisdral left a comment

Looks great, just a few small suggestions

src/Database/LSMTree/Internal/Vector/Growing.hs

Comment on lines +108 to +117

+              {-|
+                  Turns a growing vector into an ordinary vector, thereby invalidating the
+                  growing vector. Executing @unsafeFreeze vec@ is only safe when @vec@ is not
+                  used afterwards.
+              -}
+              unsafeFreeze :: GrowingVector s a -> ST s (Vector a)
+              unsafeFreeze (GrowingVector bufferRef lengthRef) = do
+                  buffer <- readSTRef bufferRef
+                  length <- readPrimVar lengthRef
+                  force <$> Mutable.unsafeFreeze (Mutable.take length buffer)

Collaborator

jorisdral Oct 7, 2024

If I remember correctly, force will always copy the contents into a new vector. As such, this implementation is actually a safe freeze

Collaborator Author

jeltsch Oct 15, 2024

I had wondered about this already before sending you this pull request. I have checked now: From a quick look at the code, I get indeed the impression that force will always copy the contents. However, the API documentation says this:

Yield the argument, but force it not to retain any extra memory, possibly by copying it.

So the specification leaves open the possibility of force returning the original vector if this vector uses its underlying array in full. We might hope that future versions of the vector package won’t make use of this opportunity, but I’d like to err on the safe side and continue considering the above function unsafe.

src/Database/LSMTree/Internal/Vector/Growing.hs

+              new initialBufferSize
+                  = do
+                      buffer <- Mutable.new initialBufferSize
+                      bufferRef <- newSTRef buffer

Collaborator

jorisdral Oct 7, 2024

Suggested change

      
                    bufferRef <- newSTRef buffer
          
                    bufferRef <- newSTRef $! buffer

Collaborator Author

jeltsch Oct 15, 2024

Hmm, I don’t think this is necessary. Since we’re using the strict version of ST, the computation Mutable.new initialBufferSize will definitely be performed before newSTRef buffer, and I’d be very surprised if new wouldn’t always yield a fully evaluated result. Thoughts?

src/Database/LSMTree/Internal/Vector/Growing.hs

+                  makeRoom :: ST s ()
+                  makeRoom = do
+                      length <- readPrimVar lengthRef

Collaborator

jorisdral Oct 7, 2024

This length was just read in the do block above

Collaborator Author

jeltsch Oct 15, 2024 •

edited

Loading

You’re absolutely right, and I was fully aware of that when I filed this pull request. 🙂 I’m not quite sure how to handle this situation. I would like to have exactly the code for “making room” be separated, so that it doesn’t clutter the main code. However, self-contained code for making room needs to contain the initial reading of the length, which fulfills a different purpose than the reading of the length in the main code, which is for saving the original length in order to use it later in a computation. An alternative would be to pass the length as an argument to makeRoom, but this seems a bit awkward to me. Would you be fine with keeping the code as it is now? Do you perhaps have a different idea of how to deal with this problem?

src/Database/LSMTree/Internal/Vector/Growing.hs

+                          let
+                              higherBufferSizes :: [Int]
+                              higherBufferSizes = tail (init ++ [last]) where

Collaborator

jorisdral Oct 7, 2024

I think maxBound should be part of the higher buffer sizes as well. last could be much less than maxBound

Collaborator Author

jeltsch Oct 15, 2024

I had considered this as well but then dropped this idea. The thing is that, while maxBound could be as “low” as 2 ^ 27 - 1 according to the Haskell Report, in practice it is likely to be 2 ^ 63 - 1 and thus so high that the number of pages will probably never get anywhere close to it. With the current implementation, the last element of higherBufferSizes is at least 2 ^ 62 and thus still so high that it will probably never be needed. I think that overflow of buffer sizes should be prevented even if it is unlikely to ever occur, simply to be safe and for the code to be correct. However, I don’t think we have to introduce the extra, albeit small, complexity that comes from adding maxBound to higherBufferSizes, given that doing so probably wouldn’t make any difference in practice. What are your thoughts on this?

src/Database/LSMTree/Internal/Vector/Growing.hs

Comment on lines +104 to +105

		-> Mutable.grow buffer (bufferSize' - bufferSize) >>=
		writeSTRef bufferRef

Collaborator

jorisdral Oct 7, 2024

Suggested change

      
                                -> Mutable.grow buffer (bufferSize' - bufferSize) >>=
          
                                   writeSTRef bufferRef
          
                                -> Mutable.grow buffer (bufferSize' - bufferSize) >>=
          
                                   (writeSTRef bufferRef $!)

Collaborator Author

jeltsch Oct 15, 2024 •

edited

Loading

I think that ($!) is not needed here, for the same reasons as in the invocation of newSTRef above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jorisdral jorisdral requested changes

dcoutts Awaiting requested review from dcoutts dcoutts is a code owner

mheinzel Awaiting requested review from mheinzel mheinzel is a code owner

recursion-ninja Awaiting requested review from recursion-ninja recursion-ninja is a code owner

Requested changes must be addressed to merge this pull request.

Labels