-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSTCompiler.Builder
should have an option to stream the FST bytes directly to Directory
#12543
Comments
Copying the comment from #10520 that's really about this issue:
Awesome, thanks @dungba88! But, at write time, can we avoid BytesStore entirely, maybe? Just write to a DataOutput? But we would need to first refactor FST#addNode to not use this reverse method on BytesStore, and do its own byte[] buffering/reversal before writing? If possible, can we do this as a first step, which we test/merge before making the change to move to DataOutput? |
Copying another comment from #10520:
Ahh, yes, exactly! (Sorry, catching up and reading comments out-of-order!).
Well, Lucene's DataOutput abstraction can be backed by heap too, e.g. ByteBufferDataOutput or ByteArrayDataOutput (there may be others!). An application could even easily make its own "double the byte[]" to grow custom DataOutput.
I think in the original issue that introduced off-heap FST reading, luceneutil benchmarks were run? Not certain though. I don't know of other benchys. |
@dungba88 raised a good point -- FST construction also needs to read prior bytes it wrote even as it is appending new bytes to the end of the file. Lucene's IndexInput/Output don't support this usage ... so that's a wrinkle we'd need to somehow overcome to have true off-heap writing. |
Thanks @mikemccand ! Let's continue the discuss in this issue instead. |
I put together a PR at #12624. I also verified with a custom dictionary (~1MB in size) that position does not go backward to previously written input, so I think it would not be a problem to create a new |
One of the thing I think is missing is that those byte manipulation methods should not be called after calling |
Copied from the PR: It seems Tantivy segregate the building and the traverse of FST as 2 different entity. The FST Builder will just write the FST to a DataOutput and not allow it to be read directly. I was thinking of this too, as currently we are mixing up the writing and reading:
I'm kind of favoring the way Tantivy is doing, it's cleaner and more "facade pattern". Maybe we could create an experimental FST Builder, that only save the FST into DataOutput? For suffix sharing, Tantivy seems to use LRU cache in write-through mode: write the new node to both the DataOutput and the cache, and read from there. It will not result in a perfectly minimal FST though. |
I put a new revision with support for DataOutput and FileChannel. When using DataOutput, if suffix sharing is enabled one also needs to pass a RandomAccessInput for reading. Otherwise it can be left null. So one can pass a IndexOutput, and RandomAccessInput can be created from IndexInput. When using FileChannel, one only needs to pass the FileChannel as that already allows both read & write at the same time. This FileChannel implementation is just for demonstration of feasibility. Some stuffs I'd like to discuss:
|
* Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Update Javadoc * Address comments
* Move size() to FSTStore * Remove size() completely * Allow FST builder to use different DataOutput * access BytesStore byte[] directly for copying * Rename BytesStore * Change class to final * Reorder methods * Remove unused methods * Rename truncate to setPosition() and remove skipBytes() * Simplify the writing operations * Update comment * remove unused parameter * Simplify BytesStore operation * tidy code * Rename copyBytes to writeTo * Simplify BytesStore operations * Embed writeBytes() to FSTCompiler * Fix the write bytes method * Remove the default block bits constant * add assertion * Rename method parameter names * Move reverse to FSTCompiler * Revert setPosition call * Address comments * Return immediately when writing 0 bytes * Add comment & * Rename variables * Fix the compile error * Remove isReadable() * Remove isReadable() * Optimize ReadWriteDataOutput * tidy code * Freeze the DataOutput once finished() * Refactor * freeze the DataOutput before use * Address comments and add off-heap FST tests * Remove the hardcoded random * Ignore the Test2BFSTOffHeap test * Simplify ReadWriteDataOutput * Update Javadoc * Address comments
@mikemccand I think this is done. Can we close it. |
Ahh yes I will close it, and attach milestone label. Hmm which 9.x did we release this in ... |
Looks like 9.10.0, from the |
Yeah we released in 9.10. Thank you! |
Description
[Spinoff from this comment inspired by Tantivy's FST implementation]
The building of an FST is inherently streamable: the way the FST freezes states as it processes inputs is a write-once, roughly append-only operation. Today, Lucene holds this growing
byte[]
entirely in RAM, and once done, writes the whole thing to disk.Yet at search time, Lucene searches the FST off-heap, doing nearly random backwards IO through
IndexInput
.Let's fix Lucene to stream the FST
byte[]
directly toIndexOutput
? This would reduce the RAM required to build so that it is constant regardless of how large an FST you are building / how many input/output pairs you are adding to it.The text was updated successfully, but these errors were encountered: