Skip to content

ethdb: introduce vectordb for persistent sequential data storage #19229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Matthalp-zz
Copy link
Contributor

@Matthalp-zz Matthalp-zz commented Mar 6, 2019

The vectordb provides a database for storing binary blobs with a
sequential relationship to one other. This is specifically targetting
blockchain data constructs that include block header, block bodies,
a block's transaction receipts, and block number to hash mappings.

This extracts out the "freezer table" from the "freezer" implementation
in #17814, with a few modifications.

The design decision to keep all data in a single file was kept, although
it will probably make sense to break that up down the road for several
reasons outside the scope of this description. However it does contain
a few modifications:

  1. It is not append-only and immutable. A VectorDB can shrink or
    grow so that it can be used for storing all relevant blockchain
    data in the presence of chain reorganizations.
  2. The metadata contained within the index file has been moved
    to a struct so that it is easier to add metadata (if needed).

TODO in this PR:

  1. Add in logging.

return err
}

if err := db.checkBounds(len); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to either move this check to after obtaining the lock, or redo it later. As it is, it could blow if two Truncate is called simultaneously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I decided to be a little more conservative and revisited the lock placement in all of the externally exposed methods.

The vectordb provides a database for storing binary blobs with a
sequential relationship to one other. This is specifically targetting
blockchain data constructs that include block header, block bodies,
a block's transaction receipts, and block number to hash mappings.

A considerable amount of this code comes from the "freezer"
prototyping work done in ethereum#17814. However it does contain a few
modifications:

1. It is not append-only and immutable. A VectorDB can shrink or
   grow so that it can be used for storing all relevant blockchain
   data in the presence of chain reorganizations.
2. The metadata contained within the index file has been moved
   to a struct so that it is easier to add metadata (if needed).
3. Adds a metadata file for the database instance (to store the
   version).

This extracts out the "freezer table" from the "freezer" prototyping
done in ethereum#17814, with a few modifications. The design decision to
keep all data in a single file was kept, although it will probably
make sense to break that up down the road for several reasons
outside the scope of this description.

TODO in this PR:
1. Add the ability to repair a VectorDB instance.
2. Add in logging.
@Matthalp-zz
Copy link
Contributor Author

@holiman I've also added some unit tests for the repair functionality and will probably add some logging tomorrow. Anything else outstanding? @karalabe WDYT?

// You should have received a copy of the GNU Lesser General Public License
// along with the go-ethereum library. If not, see <http://www.gnu.org/licenses/>.

// Package vectordb provides the vector database implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit pick, vectordb provides the vector database implementation isn't particularly helpful. Perhaps Package vectordb is a database for sequentially indexed data

@karalabe
Copy link
Member

karalabe commented Mar 8, 2019

Can you explain what the differences between this and my freezer work is? You mentioned my "freezer prototyping", but to be frank, my code was mostly production ready, whereas this PR looks kind of my code, just with comments deleted and validation checks moved into individual methods.

The only real change is that you changed the index so it's not only an array of offsets, but rather also contains the lengths, which is just redundant data as I can use two consecutive offsets to calculate the length anyway.

@karalabe
Copy link
Member

karalabe commented Mar 8, 2019

You mentioned two differences:

It is not append-only and immutable. A VectorDB can shrink or
grow so that it can be used for storing all relevant blockchain
data in the presence of chain reorganizations.

My code was fully capable of doing this. I haven't made the method public, but the functionality was there already implemented and actually used after crash recovery.

The metadata contained within the index file has been moved
to a struct so that it is easier to add metadata (if needed).

Adding any metadata would require recreating the entire index, since we need O(1) access, so we need to know exactly where the metadata for block N starts. That said, this seems over engineering to me. The goal of the freezer was to make it simple and dumb. This seems to me like an abstraction with no real use other than some future feature maybe eventually, but that will require recreating everything either way, so we can always make it smarter when needed.

@karalabe
Copy link
Member

karalabe commented Mar 8, 2019

I've separated out my previous freezer work into #19244, rebased on top of master. If you check that PR, you'll see that I've already done all the integration into Geth, from CLI to background processes, to utility methods. It also has seamless integration with rawdb, hiding the complexity from internal packages without needing to mess with them.

Now, I'm not saying that that work is ready to be merged and that we should not tweak it, extend it, etc. But that PR contains a significant amount of work that you haven't included in here yet (the whole integration path).

Also you've migrated most of my freezer_table work without making it really better, just split it into many more methods. I'm not a fan of tiny utility methods to check for error conditions (i.e. a little copy paste makes the code a lot more readable). However, I'm a bit sensitive when I see my code reproduced from scratch :P.

It's fine to keep the considerations in this PR, many does have merit and can be debated on how to integrate best, but I'd be happier if we put the work into the existing one instead of redoing it from 0.

@Matthalp-zz
Copy link
Contributor Author

@karalabe and I have discussed this PR offline. There was a misunderstanding about the state of #17814 as it was 28 days old when this new PR was posted and we had not established its direction in our prior conversations. This PR is being closed, but will address the comments here and offline discussion to close the loop:

Can you explain what the differences between this and my freezer work is? You mentioned my "freezer prototyping", ... this PR looks kind of my code, just with comments deleted and validation checks moved into individual methods.

The PR description has always attributed the origin of this code as the PR description has always said something along time lines of "This extracts out the "freezer table" from the "freezer" implementation
in #17814, with a few modifications.".

The idea was to extract this part out of #17814 to be able to review it in isolation and experiment with alternative approaches to the current freezer implementation. My preference is more granular methods that eliminate the need for comments. But that's just me. It also adds tests to demonstrate it works.

code was mostly production ready

From what I could tell there were no tests. Additionally, I think there is still some design discussion to be had about the freezer mechanics beyond the freezer table implementation. I am happy to continue that discussion on #19244.

The only real change is that you changed the index so it's not only an array of offsets, but rather also contains the lengths, which is just redundant data as I can use two consecutive offsets to calculate the length anyway.

I agree that this is more or less just a preference. I felt like the indexEntry struct made things a little more clear and the overhead is negligible compared to what will be stored in the data file.

I also foresee splitting the data file into multiple files to support pruning so there was a close-to-immediate use case and follow up PR to address that. Perhaps it was slightly premature.

Adding any metadata would require recreating the entire index, since we need O(1) access, so we need to know exactly where the metadata for block N starts. That said, this seems over engineering to me.

As mentioned above, I feel like there is value to defining the metadata format. Even if the index file only contained the offset internally I would probably still provide the indexEntry. It would just be generated slightly differently.

@Matthalp-zz Matthalp-zz closed this Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants