-
Notifications
You must be signed in to change notification settings - Fork 20.8k
ethdb: introduce vectordb for persistent sequential data storage #19229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e200cfb
to
01f8d02
Compare
return err | ||
} | ||
|
||
if err := db.checkBounds(len); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to either move this check to after obtaining the lock
, or redo it later. As it is, it could blow if two Truncate
is called simultaneously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I decided to be a little more conservative and revisited the lock placement in all of the externally exposed methods.
The vectordb provides a database for storing binary blobs with a sequential relationship to one other. This is specifically targetting blockchain data constructs that include block header, block bodies, a block's transaction receipts, and block number to hash mappings. A considerable amount of this code comes from the "freezer" prototyping work done in ethereum#17814. However it does contain a few modifications: 1. It is not append-only and immutable. A VectorDB can shrink or grow so that it can be used for storing all relevant blockchain data in the presence of chain reorganizations. 2. The metadata contained within the index file has been moved to a struct so that it is easier to add metadata (if needed). 3. Adds a metadata file for the database instance (to store the version). This extracts out the "freezer table" from the "freezer" prototyping done in ethereum#17814, with a few modifications. The design decision to keep all data in a single file was kept, although it will probably make sense to break that up down the road for several reasons outside the scope of this description. TODO in this PR: 1. Add the ability to repair a VectorDB instance. 2. Add in logging.
// You should have received a copy of the GNU Lesser General Public License | ||
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
// Package vectordb provides the vector database implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit pick, vectordb provides the vector database implementation
isn't particularly helpful. Perhaps Package vectordb is a database for sequentially indexed data
Can you explain what the differences between this and my freezer work is? You mentioned my "freezer prototyping", but to be frank, my code was mostly production ready, whereas this PR looks kind of my code, just with comments deleted and validation checks moved into individual methods. The only real change is that you changed the index so it's not only an array of offsets, but rather also contains the lengths, which is just redundant data as I can use two consecutive offsets to calculate the length anyway. |
You mentioned two differences:
My code was fully capable of doing this. I haven't made the method public, but the functionality was there already implemented and actually used after crash recovery.
Adding any metadata would require recreating the entire index, since we need O(1) access, so we need to know exactly where the metadata for block N starts. That said, this seems over engineering to me. The goal of the freezer was to make it simple and dumb. This seems to me like an abstraction with no real use other than some future feature maybe eventually, but that will require recreating everything either way, so we can always make it smarter when needed. |
I've separated out my previous freezer work into #19244, rebased on top of master. If you check that PR, you'll see that I've already done all the integration into Geth, from CLI to background processes, to utility methods. It also has seamless integration with rawdb, hiding the complexity from internal packages without needing to mess with them. Now, I'm not saying that that work is ready to be merged and that we should not tweak it, extend it, etc. But that PR contains a significant amount of work that you haven't included in here yet (the whole integration path). Also you've migrated most of my freezer_table work without making it really better, just split it into many more methods. I'm not a fan of tiny utility methods to check for error conditions (i.e. a little copy paste makes the code a lot more readable). However, I'm a bit sensitive when I see my code reproduced from scratch :P. It's fine to keep the considerations in this PR, many does have merit and can be debated on how to integrate best, but I'd be happier if we put the work into the existing one instead of redoing it from 0. |
@karalabe and I have discussed this PR offline. There was a misunderstanding about the state of #17814 as it was 28 days old when this new PR was posted and we had not established its direction in our prior conversations. This PR is being closed, but will address the comments here and offline discussion to close the loop:
The PR description has always attributed the origin of this code as the PR description has always said something along time lines of "This extracts out the "freezer table" from the "freezer" implementation The idea was to extract this part out of #17814 to be able to review it in isolation and experiment with alternative approaches to the current freezer implementation. My preference is more granular methods that eliminate the need for comments. But that's just me. It also adds tests to demonstrate it works.
From what I could tell there were no tests. Additionally, I think there is still some design discussion to be had about the freezer mechanics beyond the freezer table implementation. I am happy to continue that discussion on #19244.
I agree that this is more or less just a preference. I felt like the I also foresee splitting the data file into multiple files to support pruning so there was a close-to-immediate use case and follow up PR to address that. Perhaps it was slightly premature.
As mentioned above, I feel like there is value to defining the metadata format. Even if the index file only contained the |
The vectordb provides a database for storing binary blobs with a
sequential relationship to one other. This is specifically targetting
blockchain data constructs that include block header, block bodies,
a block's transaction receipts, and block number to hash mappings.
This extracts out the "freezer table" from the "freezer" implementation
in #17814, with a few modifications.
The design decision to keep all data in a single file was kept, although
it will probably make sense to break that up down the road for several
reasons outside the scope of this description. However it does contain
a few modifications:
grow so that it can be used for storing all relevant blockchain
data in the presence of chain reorganizations.
to a struct so that it is easier to add metadata (if needed).
TODO in this PR: